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Abstract 


High-quality  online  services  demand  reliable  packet  delivery  at  the  network  layer.  However,  clear 
evidence  documents  the  existence  of  compromised  routers  in  ISP  and  enterprise  networks,  threat¬ 
ening  network  availability  and  reliability.  A  compromised  router  can  stealthily  drop,  modify,  inject, 
or  delay  packets  in  the  forwarding  path  to  launch  Denial-of-Service,  surveillance,  man-in-the-middle 
attacks,  etc.  Unfortunately,  current  networks  fail  to  provide  any  assurance  of  data  delivery  in  ad¬ 
versarial  environments,  nor  a  reliable  way  to  identify  misbehaving  routers  that  jeopardize  packet 
delivery.  Data-plane  fault  localization  serves  as  an  imperative  building  block  to  enhance  network 
availability  and  reliability,  since  it  localizes  faulty  links  of  misbehaving  routers,  enables  a  sender  to 
find  a  fault-free  path,  and  enforces  contractual  obligations  among  network  nodes.  Until  recently, 
however,  the  design  of  secure  fault  localization  protocols  has  proven  to  be  surprisingly  elusive.  Ex¬ 
isting  fault  localization  protocols  fail  to  achieve  high  security  and  efficiency,  incur  unacceptably  long 
detection  delays,  and  require  forwarding  paths  to  be  impractically  long-lived.  In  this  dissertation, 
we  show  a  suite  of  secure  and  efficient  fault  localization  protocols  exploring  distinct  dimensions  in 
the  design  space  of  fault  localization.  Our  key  idea  is  to  achieve  a  lower  bound  on  packet  forwarding 
correctness  via  fault  localization  by  limiting  the  amount  of  malicious  packet  drops/forgeries  at  the 
data  plane,  instead  of  perfectly  detecting  every  single  malicious  activity  which  tends  to  result  in 
high  overhead.  In  this  way,  we  trap  an  attacker  into  a  dilemma:  if  the  attacker  inflicts  damage 
worse  than  a  threshold,  it  will  be  detected,  which  may  lead  to  removal  from  the  network;  other¬ 
wise,  the  damage  is  limited  and  thus  a  lower  bound  on  data-plane  packet  delivery  is  achieved.  This 
design  principle  enables  the  construction  of  efficient  probabilistic  algorithms  and  the  derivation  of 
provable  performance  bounds.  Both  the  analytical  and  experimental  results  show  that  the  proposed 
protocols  outperform  prior  work  by  100  to  1000  times  regarding  efficiency  with  provable  security 
against  sophisticated  attackers. 
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Chapter  1 


Introduction 

1.1  What  is  Fault  Localization? 

Performance-sensitive  services,  such  as  cloud  computing,  and  mission-critical  networks,  such  as  the 
military  and  ISP  networks,  require  high  assurance  of  network  data  delivery.  However,  real-world 
incidents  [2,  7,  13,  41,  55,  87]  and  studies  [14,  21,  71,  97]  reveal  the  existence  of  compromised 
routers  in  ISP  and  enterprise  networks,  and  demonstrate  that  current  networks  are  surprisingly 
vulnerable  to  data-plane  attacks.  Also,  in  a  2010  worldwide  security  survey  [1],  61%  network 
operators  ranked  infrastructure  outages  due  to  misconfigured  network  equipment  such  as  routers 
as  the  No.  2  security  threat.  A  compromised  router  or  a  dishonest  transit  ISP  can  easily  drop, 
delay,  inject  or  modify  packets  on  the  forwarding  path  to  mount  Denial-of-Service,  surveillance, 
man-in-the-middle  attacks,  etc. 

Unfortunately,  current  networks  do  not  provide  any  assurance  of  data  delivery  in  the  presence 
of  misbehaving  routers,  and  lack  a  reliable  way  to  identify  misbehaving  routers  that  jeopardize 
packet  delivery.  For  example,  a  malicious  or  misconfigured  router  can  “correctly”  respond  to 
ping  or  traceroute  probes  while  corrupting  other  packets,  thus  cloaking  the  attacks  from  ping 
or  traceroute.  Yet  most  recent  network  diagnosis  protocols  are  not  designed  for  adversarial 
environments  and  can  be  evaded  by  adversaries  [48,  49,  99,  47].  Though  secure  end-to-end  path 
monitoring  [18,  36]  and  multi-path  routing  [33,  35,  54,  72,  90,  91,  95]  can  mitigate  data-plane 
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attacks  to  some  extent,  they  are  proven  to  render  poor  performance  guarantees  [76,  97]:  without 
knowing  exactly  which  link  is  faulty,  a  source  node  may  need  to  explore  an  exponential  number 
of  paths  in  the  number  of  faulty  links  in  the  worst  case.  As  illustrated  in  Figure  1.1,  where  the 
default  route  from  S  to  D  is  path  (l7 , 2, 3  , 4) ,  end-to-end  monitoring  only  indicates  if  the  current 
path  is  faulty  without  localizing  a  specific  faulty  link  (if  any)  of  a  compromised  or  misconfigured 
router  on  the  path.  In  the  worst  case,  S  needs  to  explore  24  paths  to  find  the  path  with  no  faulty 
links,  i.e. ,  path  (1,2, 3, 4). 

Data-plane  fault  localization  serves  as  a  promising  remedy  for  securing  data  delivery.  In  a 
nutshell,  a  fault  localization  protocol  monitors  data  forwarding  at  each  hop  and  localizes  abnormally 
high  packet  loss,  injection,  and/or  forgery  on  a  certain  link.  Fault  localization  provides  the  following 
vital  benefits. 


Intelligent  path  selection.  The  current  Internet  Protocol  (IP)  instantiates  a  best-effort  service 
model  without  indicating  if,  when,  or  where  a  packet  is  lost  or  corrupted  during  the  packet  trans¬ 
mission.  Though  aiming  to  provide  reliable  packet  transmission,  TCP  is  an  end-to-end  protocol 
which  only  detects  whether  or  not  a  packet  is  lost  on  an  end-to-end  path  but  not  exactly  where 
the  packet  is  lost.  In  contrast,  fault  localization  provides  accurate  feedback  about  link  quality  and 
forwarding  behavior  of  transit  routers  in  the  path.  In  recently  advocated  edge-controlled  or  multi- 
path  routing  protocols  [94,  98,  96,  79],  edge  routers  or  source  nodes  can  utilize  such  information 
on  the  network  status  to  make  the  optimal  path  selection  and  adapt  to  adverse  network  conditions 
for  improved  reliability  and  quality  of  service. 
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Figure  1.1:  Exponential  path  exploration  problem  for  end-to-end  monitoring.  Dotted  links  are 
faulty  links  of  malicious  routers  (black  nodes). 
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Accountability.  Computer  networks  (such  as  the  Internet  and  mesh  networks)  tend  to  represent 
a  contractual  business  in  which  a  node  pays  its  peers  or  providers  for  forwarding  its  packets.  Failing 
to  provide  information  on  the  fate  of  transmitted  packets  by  current  Internet  protocols  prevents 
nodes  from  detecting  failures  of  their  peers  or  providers.  Fault  localization  provides  forwarding 
accountability,  which  refers  to  the  ability  to  associate  a  certain  forwarding  behavior  to  a  specific 
node,  or  to  hold  a  specific  node  responsible  for  its  activities.  Forwarding  accountability  proves 
to  be  a  necessary  component  for  enforcing  contractual  obligations  between  participating  nodes 
in  a  contractual  networking  service,  as  demonstrated  by  Laskowski  and  Chuang  [56].  Intuitively, 
forwarding  accountability  can  assure  each  node  that  its  partners  are  indeed  fulfilling  the  service 
agreement  for  packet  forwarding. 


Fast  failure  recovery.  Fault  localization  enables  a  source  node  S  to  identify  a  faulty  link  of  a 
malicious  router  M  on  which  M  drops,  modifies,  or  injects  packets  during  packet  forwarding.  By 
integrating  the  fault  localization  mechanism  into  edge-controlled  routing,  a  source  node  can  avoid 
using  the  identified  faulty  links  when  selecting  routing  paths,  thus  eliminating  the  exponential 
path  exploration  problem  as  shown  in  Figure  1.1.  Assuming  Ll  faulty  links  in  the  network,  a  benign 
source  node  can  identify  and  remove  all  faulty  links  and  thus  find  a  fault-free  path  after  exploring 
at  most  paths  (linear  in  Ll)  in  the  worst  case.  Figure  1.2  depicts  the  interaction  between  fault 
localization  and  routing  for  fast  failure  recovery. 


Network  diagnosis  and  performance  measurement.  Network  diagnosis  and  performance 
measurement  play  an  important  role  in  ensuring  normal  network  operations  and  performing  in¬ 
formed  traffic  engineering.  However,  current  practice  and  research  studies  in  network  diagnosis 
and  performance  measurement  largely  rely  on  ad  hoc  monitoring  and  probing,  and  assume  no 
presence  of  malicious  routers  in  the  network  [48,  49,  99,  47].  Secure  fault  localization  provides 
information  about  link  quality  which  cannot  be  biased  by  malicious  routers  and  is  thus  verifiable 
to  others  even  in  the  presence  of  adversaries. 
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Routing  plane 


Forwarding  plane 


Figure  1.2:  The  network  layer  integrating  data-plane  fault  localization  for  fast  failure  recovery 
(linear  path  exploration). 

1.2  Challenges  and  Insights 

In  addition  to  providing  security  against  strong  adversaries,  a  fault  localization  scheme  must  also  be 
practical ;  in  particular,  it  must  possess  all  of  the  following  properties:  (i)  low  detection  delay  (i.e., 
the  time  required  to  accurately  localize  a  faulty  link),  (ii)  low  computational  overhead,  (iii)  low 
communication  overhead,  and  (iv)  low  storage  overhead.  Failing  to  achieve  any  of  the  above  four 
properties  may  render  the  protocol  impractical.  For  example,  a  fault  localization  protocol  with  high 
communication  and/or  storage  overhead  will  perform  poorly  even  when  the  network  data  plane  is 
not  under  attack;  this  will  be  unacceptable  in  most  settings,  especially  in  resource-constrained 
networks  such  as  sensor  networks.  Similarly,  a  fault  localization  protocol  with  a  long  detection 
delay  will  enforce  only  a  poor  bound  on  an  adversary’s  ability  to  degrade  end-to-end  throughput 
before  being  identified.  This  may  result  in  a  significant  monetary  loss  to  a  service  provider  and, 
worse,  in  cases  where  routing  paths  change  periodically,  the  attacker  may  escape  unscathed. 

Until  now,  the  design  of  fault  localization  protocols  has  proven  to  be  surprisingly  difficult  when 
confronting  security,  efficiency,  and  agility  challenges  in  the  presence  of  strong  adversaries. 


1 . 2.  CHALLENGES  AND  INSIGHTS 


5 


•  Security  and  efficiency:  Sophisticated  attacks  such  as  framing  and  collusion  attacks  and 
natural  packet  loss  tend  to  break  fault  localization  protocols  (e.g.,  Fatih  [71],  ODSBR  [19], 
Watchers  [25],  Audit  [14],  Network  Confessional  [15],  etc)  or  lead  to  heavy-weight  protocols 
(to  prevent  sophisticated  attacks). 

•  Agility:  In  addition,  current  secure  and  relatively  light-weight  protocols  leverage  coarse¬ 
grained  flow  fingerprinting  along  end-to-end  paths  to  prevent  packet  modification  attacks 
while  reducing  communication  overhead.  However,  in  addition  to  having  high  storage  over¬ 
head,  these  techniques  result  in  long  detection  delays  and  require  monitored  paths  to  be 
long-lived  (e.g.,  after  monitoring  108  packets  over  the  same  path  in  Statistical  FL  by  Barak 
et  al.  [21]),  which  is  impractical  for  networks  with  short-lived  flows  and  agile  routing  paths. 

Our  key  insight  is  that  we  can  achieve  a  high  packet  forwarding  guarantee  via  fault  localization 
by  limiting  the  number  of  malicious  packet  drops/forgeries  at  the  data  plane,  instead  of  perfectly 
detecting  every  single  malicious  activity  which  tends  to  result  in  high  overhead.  Therefore,  strong 
per-packet  monitoring  or  authentication  to  achieve  perfect  detection  of  every  single  dropped  or 
forged  packet  is  unnecessary  for  limiting  the  adversary’s  influence.  Instead,  the  fault  localization 
protocol  can  employ  probabilistic  approaches  to  yield  statistical  guarantees,  e.g.,  via  probabilistic 
packet  monitoring  using  packet  sampling  or  probabilistic  packet  authentication  using  a  set  of  short 
packet-dependent  random  integrity  bits.  In  this  way,  each  dropped  or  forged  packet  has  a  non-trivial 
probability  of  detection.  Hence,  if  a  malicious  node  drops  or  forges  more  than  a  threshold  number  of 
packets,  the  malicious  activity  will  cause  a  detectable  deviation  in  the  state  maintained  at  different 
routers.  Essentially,  this  methodology  traps  an  attacker  in  a  dilemma:  if  the  attacker  inflicts 
damage  worse  than  a  threshold,  it  will  be  detected,  which  may  lead  to  removal  from  the  network;  if 
the  attacker  inflicts  damage  under  a  threshold,  the  damage  is  limited  and  thus  a  guarantee  on  data- 
plane  packet  delivery  is  achieved.  To  measure  the  effectiveness  of  confining  data-plane  attackers 
with  fault  localization,  we  propose  a  new  metric  called  guaranteed  forwarding  correctness, 
which  is  the  lower  bound  of  the  successful  ratio  of  packet  forwarding  achievable  along  an  end-to- 
end  path,  even  in  the  presence  of  adversaries. 
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1.3  Dissertation  Overview 


Based  on  the  philosophy  of  limiting  the  adversarial  activities,  we  propose  four  protocols  in  this 
dissertation:  PAAI,  ShortMAC,  TrueNet,  and  DynaFL,  for  secure  and  practical  data-plane  fault 
localization.  PAAI,  ShortMAC,  and  DynaFL  are  probabilistic  protocols.  More  specifically,  PAAI 
utilizes  a  secure  packet  probabilistic  sampling  technique,  ShortMAC  features  a  probabilistic  packet 
authentication  mechanism,  and  DynaFL  employs  a  probabilistic  packet  fingerprinting  data  struc¬ 
ture.  In  contrast,  TrueNet  is  a  deterministic  protocol  leveraging  trusted  computing  technologies 
with  special  hardware  support  (such  as  TPM  chips).  From  another  perspective,  both  PAAI  and 
ShortMAC  are  path-based ,  where  the  fault  localization  procedure  is  executed  on  the  granularity 
of  end-to-end  paths  and  the  source  node  of  a  path  needs  to  directly  interact  with  each  router  in 
that  path.  In  contrast,  TrueNet  and  DynaFL  are  1-hop-based,  as  the  operations  required  by  fault 
localization  are  only  performed  between  1-hop  neighbors.  Figure  1.3  summarizes  the  characteristics 
of  the  four  protocols. 


Thesis: 

data-plane  fault 
localization 


Techniques: 


Protocols: 


Figure  1.3:  Summary  of  the  proposed  protocols  in  this  dissertation. 


These  four  protocols  explore  different  approaches  and  directions  in  the  design  space  of  fault 
localization,  and  achieve  various  tradeoffs  between  storage  overhead,  communication  overhead, 
computational  overhead,  and  deployability.  We  summarize  the  tradeoffs  of  the  proposed  protocols 
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using  these  performance  metrics  in  Table  1.1  to  provide  some  intuition  (we  will  provide  a  more 
detailed  comparison  including  the  effectiveness  of  fault  localization  in  Chapter  9).  Our  results  in 
this  dissertation  show  that  by  limiting  the  adversarial  activities  with  probabilistic  algorithms  or 
emerging  hardware  technologies,  secure  fault  localization  can  be  achieved  with  a  lower  bound  on 
the  forwarding  performance  and  fundamentally  higher  efficiency  than  previously  known  protocols 
for  fault  localization.  Our  proposed  fault  localization  protocols  also  address  the  security  threats 
that  defy  most  prior  work. 


Protocol 

Storage 

Communication 

Computation 

Deployability 

PAAI 

per-path  state 

»  3% 

per-packet  PRF 

loose  time  sync 

ShortMAC 

per-path  state 

<  0.1% 

per-packet  MAC 

change  packet  header 

TrueNet 

per-neighbor  state 

<  0.1% 

per-packet  MAC 

change  packet  header 
require  TPMs 

DynaFL 

per-neighbor  state 

<  0.1% 

per-packet  hash 

loose  time  sync 

Table  1.1:  Metrics  and  tradeoffs. 


The  remainder  of  this  dissertation  includes  the  following  chapters.  Chapter  2  formally  defines 
the  problem  and  states  the  assumptions.  Chapter  3  sketches  the  challenges  in  achieving  secure 
fault  localization  by  presenting  several  strawman  approaches  and  their  security  vulnerabilities. 

Chapters  4  and  5  present  the  two  path-based  protocols,  PAAI  and  ShortMAC,  respectively. 
Both  protocols  (and  path-based  fault  localization  protocols  in  general)  require  a  source  node  S  to 
solicit  acknowledgments  from  intermediate  routers  in  the  forwarding  path  for  the  packets  S  has  sent. 
PAAI  explores  schemes  where  a  single  acknowledgment  returned  by  a  router  only  acknowledges  a 
single  packet  that  S  has  sent,  and  focuses  on  studying  how  to  employ  secure  sampling  to  reduce 
protocol  overhead:  whether  and  how  to  sample  a  subset  of  packets  to  acknowledge,  or  whether  and 
how  to  sample  a  subset  of  routers  to  send  the  acknowledgments.  In  contrast,  ShortMAC  studies  a 
different  approach,  using  a  single  acknowledgment  to  acknowledge  a  set  of  aggregated  packets  for 
higher  efficiency. 

To  overcome  certain  limitations  of  path-based  fault  localization  protocols  (e.g.,  poor  support 
for  dynamic  routing  paths),  Chapters  6  and  7  present  two  1-hop-based  protocols,  TrueNet  and 
DynaFL,  respectively.  TrueNet  assumes  the  deployment  of  trusted  computing  components  in  the 
network,  and  thus  achieves  secure  fault  localization  with  high  efficiency  unachievable  in  traditional 
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networks.  Due  to  the  inherent  limitation  of  current  trusted  computing  technologies,  TrueNet  is 
most  effective  against  software-based  (as  opposed  to  hardware-based)  data-plane  attacks,  which  we 
argue  is  the  major  form  of  large-scale  data-plane  attacks.  DynaFL  implements  secure  1-hop-based 
fault  localization  without  relying  on  trusted  computing,  and  thus  is  resilient  against  hardware-based 
attacks  as  well.  DynaFL  aims  to  localize  forwarding  faults  to  a  specific  1-hop  neighborhood  instead 
of  a  specific  link,  trading  precision  for  practicality  of  fault  localization. 

Finally,  Chapter  8  summarizes  the  related  work  and  Chapter  9  concludes  the  dissertation. 


Chapter  2 


Thesis,  Problem  Statements,  Metrics, 
and  Assumptions 

2.1  General  Thesis 

This  dissertation  aims  to  achieve  secure  and  efficient  data-plane  fault  localization  and  explore  the 
tradeoffs  in  this  design  space.  More  specifically,  given  a  set  of  adversarial  nodes  in  the  network,  we 
are  interested  in  the  design  of  protocols  that  monitor  the  forwarding  behavior  of  intermediate  nodes 
for  packet  dropping,  modification,  injection,  and  delaying  activities  over  a  period  of  time  and  then 
securely  localize  the  presence  of  the  adversary  on  a  particular  link  (or  a  set  of  links).  Note  that 
the  literature  has  showed  that  such  protocols  can  only  identify  links  adjacent  to  malicious  nodes, 
rather  than  identifying  the  nodes  [21].  Our  thesis  statement  is  as  follows: 

Thesis  statement.  Instead  of  aiming  to  detect  any  single  forwarding  failure,  we  explore  if  fault 
localization  can  be  utilized  to  limit  the  damage  an  adversary  can  inflict  at  the  data  plane  and  in 
turn  produce  a  provable  lower  bound  on  the  forwarding  correctness.  We  also  attempt  to  see  if  the 
philosophy  of  limiting  the  adversarial  activities  can  enable  the  use  of  probabilistic  algorithms  and 
emerging  hardware  virtualization  technologies,  which  may  give  rise  to  negligible  protocol  overhead 
without  sacrificing  security. 
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To  support  this  thesis  statement,  we  take  the  following  steps  in  this  dissertation.  We  classify 
the  fault  localization  protocols  into  path-based  and  1-hop-based.  In  a  path-based  approach,  the 
fault  localization  process  is  operating  on  individual  end-to-end  paths,  where  the  source  node  of  a 
path  requires  provable  “receipts”  (or  acknowledgments)  from  the  destination  and  the  intermediate 
nodes  on  the  forwarding  path  for  the  packets  that  the  source  node  has  sent.  We  study  probabilistic 
algorithms  exploring  different  design  dimensions  to  reduce  the  protocol  overhead: 

•  intermediate  nodes  only  send  packet  receipts  for  a  probabilistically  selected  subset  of  packets 
(PAAI); 

•  only  a  probabilistically  selected  subset  of  packets  send  packet  receipts  to  the  source  (PAAI); 

•  instead  of  acknowledging  a  single  packet,  a  packet  receipt  can  acknowledge  a  set  of  packets 
aggregated  in  a  probabilistic  and  efficient  way  (ShortMAC). 

In  a  1-hop-based  approach,  the  fault  localization  process  is  running  between  1-hop  neighbors,  i.e. , 
each  node  only  monitors  its  1-hop  neighbors  to  detect  any  forwarding  fault.  As  we  show  later, 
compared  to  path-based  approaches,  1-hop-based  fault  localization  can  better  cope  with  dynamic 
routing  paths  and  traffic  patterns,  but  tends  to  localize  data-plane  faults  to  a  1-hop  neighborhood 
instead  of  a  specific  link  (DynaFL).  We  also  show  that,  with  the  aid  of  trusted  computing,  1-hop- 
based  fault  localization  can  localize  data-plane  faults  to  a  specific  link  with  much  lower  overhead 
(TrueNet). 

2.2  Scope  and  Assumptions 


Scope.  Since  we  focus  on  data-plane  security  at  the  network  layer,  we  assume  the  following 
network  control-plane  and  link-layer  mechanisms,  each  of  which  represents  a  separate  line  of  research 
orthogonal  to  ours. 

•  We  can  borrow  existing  secure  routing  protocols  [50,  42,  73,  96]  by  which  nodes  can  learn  the 
genuine  network  topology  and  the  source  can  know  the  outgoing  path. 
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•  We  assume  secure  neighbor  identification  so  that  a  node  upon  receiving  a  packet  knows  which 
neighbor  sent  that  packet,  which  can  be  achieved  via  link-layer  authentication. 

•  In  addition,  when  needed,  a  source  node  S  can  set  up  a  shared  secret  key  Ksi  with  router 
fi  using  a  well-studied  key  exchange  protocol,  e.g.,  Diffie-Hellman  as  in  Passport  [60].  This 
symmetric  key  exchange  happens  very  infrequently  thus  representing  only  a  one-time  cost. 
Barak  et  al.  [21]  prove  that  such  a  shared  secret  is  necessary  for  any  secure  fault  localization 
protocol  via  path  monitoring. 

We  focus  on  achieving  secure  fault  localization  against  malicious  routers.  We  do  not  consider 
control-plane  or  routing  attacks  and  endhost-  or  source-based  attacks  such  as  DoS,  while  TrueNet 
complements  existing  secure  routing  [26,  27,  37]  or  DoS  prevention  schemes  [92], 

Cryptography  assumptions.  For  the  sake  of  efficiency,  we  avoid  using  per-packet  asymmetric 
cryptography  due  to  its  high  per-packet  computation  and  communication  overhead.  We  assume 
that  the  nodes  can  perform  symmetric  key  operations  as  well  as  compute  a  collision-resistant  hash 
function  and  a  keyed  pseudo-random  function  PRF. 

Network  model.  We  consider  a  general  multi-hop  network  model  where  routers  relay  packets 
between  sources  and  destinations,  such  as  the  ISP,  enterprise,  and  datacenter  networks.  We  as¬ 
sume  that  the  links  in  the  network  independently  exhibit  some  natural  packet  loss  due  to  congestion 
and/or  transmission  errors.  Throughout  the  paper,  we  follow  the  notation  as  illustrated  in  Fig¬ 
ure  2.1.  We  denote  the  routers  in  a  path  by  /i,/2,  ■  ■  ■  ,fd-i,  the  destination  by  fd,  and  the  link 
between  /j_i  and  fi  by  k.  We  call  nodes  closer  to  the  destination  downstream  nodes,  and  nodes 
that  are  further  away  from  the  destination  as  upstream  nodes. 

Basic  notation.  We  denote  the  round-trip  time  from  a  node  fi  to  fd  as  r*.  Let  Ex(-)  denote 
encryption  using  symmetric  key  K.  Further,  let  MAC^'(m)  denote  a  message  m  authenticated 
by  key  K  using  a  message  authentication  code  (MAC).  For  simplicity,  in  our  description,  we  do 
not  differentiate  between  the  keys  for  encryption  and  MAC  computation;  although  in  practice,  one 
would  derive  separate  keys  for  encryption  and  MAC  computation. 
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upstream  downstream 

<  y 

Source  Destination 

• - 7 - •  •  •  •  • - y - •  •  *  • - y - • 

S  h  /,  Ai  ''  fi  fdA  ld  fd 

Figure  2.1:  An  example  path  and  notation. 

2.3  Attacker  Model 

We  assume  an  adversary  in  complete  control  of  an  arbitrary  number  of  intermediate  nodes  on  a  path, 
including  knowledge  of  their  secret  keys.  The  adversary  can  eavesdrop  and  perform  traffic  analysis 
anywhere  on  the  path.  The  adversary  may  drop,  inject  or  alter  packets  on  the  links  that  are  under 
its  control.  We  allow  the  protocol  parameters  to  be  public;  consequently,  the  adversary  may  try  to 
bias  the  measurement  results  in  order  to  evade  detection  or  incriminate  honest  links.  However,  the 
adversary  cannot  control  the  natural  packet  loss  rate  on  the  links  in  the  path,  because  this  would 
constitute  a  physical-layer  attack  which  can  be  dealt  with  through  physical-layer  protections. 

The  goal  of  an  adversary  who  controls  malicious  routers  is  to  sabotage  data  delivery  at  the 
forwarding  path.  Instead  of  considering  an  individual  forwarding  attack,  we  seek  a  general  way  of 
defining  malicious  forwarding  behavior.  We  identify  packet  dropping  and  packet  injection  as  the 
two  fundamental  data-plane  threats,  while  other  data-plane  attacks  can  be  reduced  to  these  two 
threats  as  follows:  (i)  packet  modification  is  equivalent  to  dropping  the  original  packet  and  injecting 
a  fabricated  packet,  (ii)  packet  replay  can  be  regarded  as  packet  injection,  (iii)  packet  delay  can 
be  treated  as  dropping  the  original  packet  and  later  injecting  it,  and  (iv)  packet  misrouting  can 
be  regarded  as  dropping  packets  along  the  original  path  and  injecting  them  on  the  new  path.  A 
formal  definition  follows: 

Definition  1.  An  (x,  y)— Malicious  Router  is  a  router  that  intentionally  drops  up  to  a  fraction 
x  of  the  legitimate  data  packets  from  a  source  S  to  a  destination  fd,  and  injects  up  to  y  spurious 
packets  to  fd,  pretending  that  the  packets  originate  from  S.  The  misbehavior  space  of  such  a  mali¬ 
cious  router  comprises  (i)  dropping  packets,  (ii)  injecting  packets  on  any  of  its  adjacent  links  which 
we  call  malicious  links  (non-malicious  links  are  called  benign  links),  (iii)  strategically  claiming 
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arbitrary  local  state  (e.g.,  number  of  packets  received)  to  its  own  advantage,  or  (iv)  colluding  with 
other  malicious  routers  to  perform  the  above  attacks. 

Such  a  strong  attacker  model  is  not  merely  born  out  of  academic  curiosity,  but  has  been  widely 
witnessed  in  practice.  For  example,  outsider  attackers  have  leveraged  social  engineering,  phish¬ 
ing  [7],  exploration  of  router  software  vulnerabilities  [2,  13],  and  compromising  weak  passwords  [41] 
to  compromise  ISP  and  enterprise  routers  [87].  Also,  in  a  2010  worldwide  security  survey  [1],  61% 
of  network  operators  ranked  infrastructure  outages  due  to  misconfigured  routers,  which  also  fall 
under  our  attacker  model,  as  the  No.  2  security  threat. 

Finally,  we  note  that  the  protocols  proposed  in  this  dissertation  heavily  depend  on  several 
cryptographic  primitives,  such  as  the  Message  Authentication  Code  (MAC)  and  Pseudo-Random 
Function  (PRF).  Though  different  protocols  may  utilize  different  implementations  or  instantiations 
of  these  cryptographic  primitives,  we  assume  the  MAC  resists  existential  forgery  under  chosen- 
plaintext  attacks,  and  the  PRF  with  a  randomly  chosen  key  provides  outputs  that  look  unpre¬ 
dictable  and  cannot  be  distinguished  from  a  truly  random  function  (except  with  a  negligible  prob¬ 
ability).  In  other  words,  we  assume  the  adversary  cannot  break  these  cryptographic  primitives 
(though  different  implementations  of  them  may  be  resilient  against  different  numbers  of  adversarial 
queries,  assuming  their  high-level  security  properties  suffices  for  this  dissertation). 

2.4  Problem  Formulation 

This  dissertation  focuses  on  providing  data-plane  fault  localization  for  a  lower-bound  guarantee  on 
data-plane  packet  delivery.  In  this  section,  we  define  detection  thresholds,  faulty  links,  and  finally 
we  formalize  fault  localization. 

We  introduce  the  detection  thresholds  to  limit  the  adversarial  activities  at  network  data  plane: 

Definition  2.  Given  a  drop  detection  threshold  T^r  (i.e.,  fraction  of  dropped  packets)  and  an 
injection  detection  threshold  T{n  (i.e.,  number  of  injected  packets),  a  linkli  is  defined  as  faulty 
iff:  (i)  more  than  T*.  fraction  of  packets  are  dropped  on  li  by  f,  or  (ii)  more  than  Tin  packets  are 
injected  by  f,  or  (in)  the  adjacent  router  f  or  /)+ 1  makes  h  appear  faulty  over  a  period  of  time. 
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When  Tdr  and  Tjn  are  carefully  set  based  on  the  prior  knowledge  such  that  the  natural  packet 
loss  and  corruption  are  below  T*.  and  Tjn,  respectively,  a  faulty  link  must  be  a  malicious  link. 

Definition  3.  (N,5)— Data- Plane  Fault  Localization  is  achieved  iff:  given  an  end-to-end  com¬ 
munication  pathp,  after  a  detection  delay  of  sending  N  packets,  the  source  node  S  of  path  p  can 
identify  a  specific  faulty  link  along  that  path  (if  any)  with  false  positive  or  negative  rate  less  than  5. 

Definition  4.  (D,  6)—  Guaranteed  Forwarding  Correctness  (Guaranteed  Data- Plane  Packet 
Delivery)  is  achieved  iff:  after  exploring  at  most  paths,  a  source  can  find  a  non- faulty  path 
(\{  any,)  along  which  all  routers  have  correctly  forwarded  at  least  6  fraction  of  the  source’s  data 
packets  sent  along  the  path  to  fd- 

To  achieve  a  guaranteed  6,  we  need  to  bound  (not  necessarily  eliminate )  the  adversary’s  ability 
to  drop  packets  and  to  inject  packets  so  that  if  the  adversary  drops  more  than  a  percent  of  packets 
or  injects  (3  bogus  packets,  it  will  be  detected  with  a  high  probability.  A  formal  definition  follows. 

Definition  5.  For  an  epoch  with  a  sufficiently  large  number  of  data  packets  by  a  source,  a  fault 
localization  protocol  achieves  (a,  /3)$— Forwarding  Security  iff  two  conditions  are  simultaneously 
satisfied: 

1.  fLow  False  Negative  Rate,)  When  the  adversary  drops  more  than  a  percent  of  the  data  packets 
on  a  single  link,  or  injects  more  than  (3  fake  packets  on  a  single  link,  the  source  will  detect  at 
least  one  of  the  malicious  links  under  the  adversary’s  control  with  probability  at  least  1  —  5; 

2.  (Low  False  Positive  Rate,)  The  probability  of  falsely  incriminating  at  least  one  benign  link  is 
at  most  5. 

2.5  Metrics 

The  forwarding  correctness  (Definition  4)  and  forwarding  security  (Definition  5)  provide  a  system¬ 
atic  way  to  quantify  the  effectiveness  and  security  of  fault  localization.  We  also  identify  three  key 
metrics  to  evaluate  the  efficiency  or  practicality  of  such  protocols: 
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•  detection  rate,  i.e. ,  the  number  of  data  packet  transmissions  required  to  detect  a  malicious 
link  (with  the  false  positive  and  negative  rates  below  a  certain  threshold), 

•  communication  overhead,  i.e.,  the  additional  packets  (and  their  size)  that  are  sent  per  data 
packet  from  the  source,  and 

•  storage  overhead,  i.e.,  the  amount  of  temporary  storage  that  must  be  maintained  at  each 
intermediate  node  per  unit  time. 
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Security  Challenges 


A  common  approach  to  achieve  data-plane  fault  localization  is  for  the  source  node  to  require  ac¬ 
knowledgment  packets  (ACK)  from  the  destination  and  the  intermediate  nodes  in  the  forwarding 
paths.  In  a  realistic  setting,  a  forwarding  link  may  incur  some  benign  packet  loss  due  to  congestion 
or  channel  errors.  At  the  same  time,  an  adversary  who  potentially  controls  multiple  intermediate 
nodes  may  try  to  bias  the  identification  procedure  by  selectively  dropping,  modifying,  or  injecting 
packets  in  order  to  evade  detection  or  incriminate  honest  nodes.  Consequently,  a  secure  fault  local¬ 
ization  protocol  must  be  simultaneously  robust  to  both  benign  packet  loss  and  malicious  behavior. 
In  other  words,  it  must  exhibit  low  false  positive  (falsely  identifying  a  legitimate  link  as  malicious) 
and  false  negative  (falsely  leaving  a  malicious  link  undetected)  rates.  This  chapter  presents  sev¬ 
eral  common  security  pitfalls  or  vulnerabilities  in  prior  fault  localization  schemes  to  illustrate  the 
challenges  in  achieving  security  against  strong  adversaries. 

3.1  Challenge  1:  Acknowledgment-based  Approach 

Let  us  consider  that  the  source  S  in  Figure  2.1  sends  out  a  data  packet  m  towards  the  destination 
fd .  Upon  receiving  m  at  each  hop  in  the  path,  router  /*  must  return  an  acknowledgment  (ACK) 
to  S  authenticated  with  the  secret  key  shared  with  S  (assuming  S  and  ft  have  pre-established  a 
secret  key  using  Diffie-Hellman  [32]  as  in  Passport  [60],  or  some  other  key  exchange  protocol).  If 
S  receives  correct  ACKs  from  routers  fi, . . . ,  ,/)_i  but  not  from  router  /),  S  concludes  link  U-\  is 
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faulty.  In  this  approach  however,  a  malicious  router  ft  can  drop  the  ACK  from  another  remote 
router,  say  fi+5,  without  dropping  other  packets  to  frame  k+4  as  malicious  to  the  source  ( framing 
attack). 

To  reduce  the  overhead  of  ACK  packets,  the  source  node  may  “sample”  a  subset  of  packets 
and  only  the  sampled  packets  will  require  ACKs  from  the  routers.  In  this  approach  however,  if  a 
malicious  router  fm  can  distinguish  between  sampled  and  non-sampled  packets,  fm  can  safely  drop 
all  and  only  non-sampled  packets  without  being  detected. 

Chapter  4  presents  more  sophisticated  attacks  against  such  acknowledgment-based  fault  local¬ 
ization  protocols. 

3.2  Challenge  2:  Sophisticated  Packet  Modification  Attacks 

In  Fatih  [71],  WATCHERS  [25,  43],  and  Audit  [14],  each  router  records  a  traffic  summary  based 
on  counters  or  Bloom  Filters  [24],  which  are  updated  with  no  secret  keys  for  the  packets  the  router 
forwards.  The  routers  periodically  exchange  local  summaries  with  others  for  fault  detection  based 
on  flow  reservation.  Without  any  authentication  of  the  data  packets,  these  schemes  suffer  from 
packet  modification  attacks.  For  example  in  Audit  [14],  each  router  simply  counts  the  number  of 
packets  it  received  for  a  certain  path,  and  periodically  sends  the  counter  to  the  source  node  of  the 
path  for  packet  loss  detection.  However,  malicious  packet  modification  cannot  be  detected  based 
solely  on  packet  counts.  Even  when  Bloom  Filters  are  used  [71]  to  reflect  the  packet  contents,  a 
malicious  router  can  still  tactically  modify  packets  without  affecting  the  Bloom  Filter  image  (since 
Bloom  Filters  may  not  be  collision-resistant). 

Chapter  7  describes  more  challenges  in  dealing  with  packet  modification  attacks  in  fault  local¬ 
ization  protocols  relying  on  flow  conservation. 


3.3  Challenge  3:  Colluding  attacks 

Routers  in  a  path  may  employ  “hop-by-hop”  monitoring  to  detect  packet  delivery  fault  to  reduce 
the  communication  overhead  of  sending  the  traffic  summaries  back  to  the  source.  For  example  in 
Figure  2.1,  each  router  /*  asks  for  the  traffic  summaries  (e.g.,  acknowledgments)  only  from  the 
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2-hop  neighbor  /j+ 2  in  the  path,  and  accuses  U  if  fi  does  not  receive  the  correct  traffic  summaries. 
In  this  approach  however,  if  /*  is  colluding  with  /*+ 1  and  does  not  accuse  /*+ 1  even  if  fi  does  not 
receive  the  correct  traffic  summaries  from  /i+2,  then  fi+ 1  can  safely  drop  packets  without  being 
detected.  Watchdog  [66],  Catch  [65],  and  the  proposal  due  to  Liu  et  al.  [58]  are  vulnerable  to  similar 
colluding  attacks. 
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In  this  chapter,  we  present  our  first  path-based  fault  localization  protocol,  PAAI  (including  PAAI-1 
and  PAAI-2),  that  is  robust  against  strong  adversaries  with  a  practical  tradeoff  between  detection 
delay,  communication  overhead,  and  storage  overhead.  As  a  path-based  protocol,  in  PAAI,  the 
source  node  requires  packet  receipts,  or  acknowledgments  (ACKs),  for  the  packets  the  source  has 
sent. 

Instead  of  sending  an  ACK  for  each  received  packet  by  each  router,  PAAI  employs  secure  sam¬ 
pling  to  reduce  the  communication  overhead  incurred  by  the  ACKs.  We  systematically  explore  the 
design  space  of  utilizing  secure  sampling  for  path-based  fault  localization  protocols.  We  investigate 
a  set  of  basic  protocols,  each  exemplifying  a  design  dimension  and  examine  the  underlying  trade¬ 
offs.  In  particular,  PAAI-1  and  PAAI-2  sample  along  different  dimenions:  PAAI-1  investigates  how 
to  sample  a  subset  of  packets  to  be  acknowledged  by  all  the  routers  in  the  forwarding  path,  and 
PAAI-2  investigates  how  to  sample  a  subset  of  routers  to  send  the  ACKs  for  each  packet.  We  also 
show  the  possibility  of  constructing  hybrid  protocols  based  on  PAAI-1  and  PAAI-2. 

To  clearly  demonstrate  the  tradeoff  between  the  two  sampling  approaches,  in  PAAI,  a  single 
ACK  sent  by  a  router  only  acknowledges  one  single  packet  sent  by  the  source,  while  we  may 
extend  the  protocols  to  enable  one  ACK  to  acknowledge  a  set  of  packets,  just  like  ShortMAC 
shown  in  Chapter  5.  For  the  ease  of  understanding,  the  PAAI  protocol  described  in  this  chapter 
only  focuses  on  detecting  packet  dropping  and  modification  attacks.  For  PAAI-1  and  PAAI-2, 
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we  present  both  upper  and  lower  performance  bounds  via  theoretical  analysis,  and  average-case 
results  via  simulations.  We  conclude  that  the  proposed  PAAI-1  protocol  outperforms  other  related 
schemes. 

4.1  Introduction 

We  observe  that  designing  any  path-based  fault  localization  protocols  using  ACKs  involves  making 
two  fundamental  decisions: 

1.  which  data  packets  to  acknowledge;  and 

2.  which  intermediate  nodes  should  send  the  ACKs. 

With  this  in  mind,  we  explore  different  design  choices  along  the  two  aforementioned  aspects  and 
investigate  the  tradeoff  using  the  performance  metrics.  More  specifically,  we  study  the  following 
approaches: 

1.  A  strawman  approach:  Every  intermediate  node  sends  an  ACK  for  every  lost  or  modified 
data  packet. 

2.  The  Probabilistic  ACK-based  Adversary  Identification  (PAAI)  approaches:  either  (i)  only  a 
subset  of  data  packets  must  be  acknowledged  (PAAI-1),  or  (ii)  only  a  subset  of  intermediate 
nodes  must  respond  to  an  ACK  request  (PAAI-2). 

The  full- ACK  scheme  achieves  the  lowest  detection  delay  by  determining  the  link  for  every 
single  packet  transmission  failure.  However,  gathering  such  fine-grained  information  introduces 
high  communication  overhead.  In  contrast,  PAAI-1  and  PAAI-2  employ  probabilistic  sampling  to 
gather  only  coarse-grained  information,  differing  from  each  other  in  that  they  perform  probabilistic 
sampling  in  different  dimensions.  In  both  PAAI  schemes,  we  aim  to  achieve  a  low  detection  delay 
while  retaining  practicality  for  most  networks. 

The  PAAI-1  protocol  is  fairly  intuitive,  simple  and  flexible,  yet  achieves  more  desirable  proper¬ 
ties  than  the  full- ACK  scheme,  PAAI-2,  and  other  related  work.  Finally,  we  also  discuss  the  viability 
of  constructing  protocols  that  exemplify  hybrids  of  the  basic  design  primitives  (Section  4.11). 
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Contribution.  To  the  best  of  our  knowledge,  this  is  the  first  attempt  to  design  a  secure  fault 
localization  protocol  that  obtains  a  practical  trade-off  between  the  detection  rate  and  the  communi¬ 
cation  and  storage  overhead  for  realistic  network  settings.  It  is  also  the  first  systematic  study  of  the 
design  space  for  fault  localization  protocols  (Section  4.3,  Section  4.4  and  Section  4.5).  We  propose 
a  set  of  basic  fault  localization  protocols,  one  for  each  design  dimension,  where  the  PAAI-1  proto¬ 
col  (Section  4.5.1)  is  distinctly  more  practical  than  the  others.  We  obtain  theoretical  bounds  for 
the  performance  of  our  protocols  (Section  4.8),  and  also  launch  simulations  to  derive  average-case 
results  and  validate  our  theoretical  results  (Section  4.9). 

4.2  Setting 

Besides  the  problem  formulation  described  in  Chapter  2,  we  introduce  additional  assumptions  and 
conventions  for  this  chapter  below. 

ACK  structure.  For  any  data  packet  m  sent  out  by  S,  let  the  hash  of  m,  denoted  by  H[m],  be  a 
packet  identifier  for  m.  For  any  m,  we  define  the  corresponding  ACK  from  ft  to  have  the  structure 
ai  =  (H[m]\\Af'),  where  *4™  is  a  report  computed  by  /) .  The  report  A™  will  be  a  function  of  ff  s 
local  report  7 Zi  and  its  downstream  neighbor  /i+i’s  ACK  (if  present).  Specific  details  may  vary  in 
each  protocol  description. 

Onion  reports.  We  recall  the  well-known  notion  of  an  onion  report.  When  each  intermediate 
node  fi  must  return  a  local  report  7 Zi  to  S  in  an  authenticated  manner,  then  we  have  inductively, 
for  i  €  [1,  d  —  1],  Ai  =  MAC,Ki(z||7lj||A;+i),  while  Ad  =  MACr^HT^). 

Assumptions.  We  assume  the  presence  of  symmetric  paths,  where  the  forward  path  (for  data) 
and  reverse  paths  (for  acknowledgments)  are  identical;  and  we  assume  that  a  source  node  knows 
its  forwarding  path  to  the  destination.  We  assume  that  the  nodes  on  any  given  path  are  loosely 
time-synchronized. 

Finally,  given  a  path  from  a  source  to  a  destination,  recall  from  Section  2.2  that  the  source  can 
establish  a  shared  pairwise  symmetric  key  Ksi  with  each  intermediate  node  fi  on  the  path  to  the 
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destination. 

Throughout  this  chapter,  we  focus  on  the  localization  of  packet  dropping  and  modification 
attacks,  and  use  the  general  term  packet  corruption  to  denote  packet  dropping  and  modification 
activities. 


4.3  A  Strawman  Approach:  Full-ack 

We  observe  that  designing  any  path-based  fault  localization  protocol  involves  making  two  funda¬ 
mental  decisions:  (i)  which  data  packets  to  acknowledge,  and  (ii)  which  intermediate  nodes  should 
send  the  acknowledgments.  As  a  first  step  towards  a  systematic  exploration  of  the  protocol  design 
space  for  fault  localization  protocols  along  the  two  aforementioned  aspects,  we  discuss  the  simple 
and  fairly  intuitive  ‘full-ack’  protocol  (similar  to  the  Optimistic  Per-Packet  FL  Protocol  from  Barak 
et  al.  [21]),  where  every  intermediate  node  on  the  forwarding  path  must  return  an  ACK  for  every 
corrupted  data  packet  sent  by  the  source.  A  corrupted  packet  refers  to  one  that  fails  to  reach  the 
destination  intact  (either  dropped  or  modified).  Below,  we  give  a  brief  description  of  the  protocol 
and  discuss  its  security  and  performance.  A  theoretical  analysis  and  simulation  results  for  the 
full-ACK  protocol  are  given  later  in  Section  4.8  and  Section  4.9  respectively. 


Protocol.  Let  us  consider  that  S  sends  out  a  data  packet  m  towards  the  destination  fd-  On 
receiving  m,  fd  must  return  an  ack,  ad,  authenticated  with  the  secret  key  shared  with  S,  i.e., 
ad  =  MAC Kad(H[m]).  If  no  valid  ACK  is  received  from  fd  within  a  pre-specified  wait-time,  S  will 
send  out  an  onion  report  request.  The  onion  report  is  computed  by  the  intermediate  nodes  in  the 
manner  explained  earlier,  wherein  a  local  report  7 Zi  is  set  to  be  (i||Lf[m]||ad).  Upon  receiving  the 
ACK  containing  the  onion  report  from  f\,  S  can  sequentially  verify  each  report  embedded  in  it. 
For  some  i  <  d,  if  the  MAC  from  each  intermediate  node  fj,j  E  [l,i]  is  valid  but  the  MAC  from 
fl+ 1  is  invalid  or  not  present  in  the  final  ack,  then  S  identifies  link  Zj  as  faulty  and  adds  one  to  its 
corruption  score.  Over  a  period  of  time,  if  the  corruption  score  of  a  particular  link  exceeds  a  fixed 
threshold  determined  from  the  natural  packet  loss  rate  then  that  link  is  identified  as  malicious. 
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Security.  In  the  above  protocol,  if  a  malicious  node  corrupts  a  packet  (data  or  ack),  one  of  its 
adjacent  links  has  its  corruption  score  increased.  This  follows  largely  from  the  security  of  onion 
reports.  Since  PAAI-1  employs  similar  techniques,  we  defer  more  details  to  Section  4.4.  The 
adversarial  nodes  on  the  path  may  collude  to  share  packet  corruption  activities  among  themselves; 
however,  in  this  case,  the  corruption  rate  will  still  be  bounded  (proportional  to  the  number  of 
malicious  nodes  in  the  path). 

Performance.  For  each  corrupted  packet,  the  full- ACK  scheme  can  determine  precisely  the  loca¬ 
tion  of  the  packet  corruption,  thus  it  is  able  to  directly  compute  the  corruption  rate  of  each  link  on 
a  given  path  and  identify  malicious  links  within  a  small  number  of  packet  corruption.  However,  this 
high  detection  rate  is  achieved  at  the  price  of  a  large  amount  of  communication  overhead  at  each 
node.  Specifically,  the  full-ACK  scheme  imposes  an  overhead  of  at  least  one  packet  of  0(l)-size  per 
data  packet  sent  out  by  S;  and  an  additional  overhead  of  one  packet  of  0(d)-size  (the  onion  report) 
in  case  packet  corruption  occurs.  The  storage  overhead  is  high  in  the  worst  case  but  lower  on 
average  due  to  the  low  detection  delay.  More  details  are  given  later  in  Section  4.8  and  Section  4.9. 

The  high  overhead  makes  the  full-ACK  protocol  unaffordable  for  most  networks;  therefore,  fault 
localization  protocols  with  a  better  trade  off  amongst  the  three  performance  metrics  are  desirable. 

4.4  Overview  of  PAAI 

In  contrast  to  the  full-ACK  protocol,  where  the  ACK  mechanism  was  completely  deterministic,  we 
now  investigate  probabilistic  ack-based  adversary  identification  (PAAI)  approaches  with  the  under¬ 
lying  motive  of  reducing  the  protocol  overhead  at  the  expense  of  slightly  worsening  the  detection 
delay.  Loosely  speaking,  we  investigate  two  contrasting  approaches:  one  where  only  a  subset  of 
data  packets  must  be  acknowledged,  and  another  where  only  a  subset  of  intermediate  nodes  must 
send  the  acknowledgments1.  In  particular,  we  construct,  (i)  the  PAAI-1  protocol:  every  intermedi¬ 
ate  node  sends  an  ACK  for  only  a  selected  fraction  of  data  packets;  and  (ii)  the  PAAI-2  protocol: 
only  one  selected  intermediate  node  sends  an  ACK  for  each  data  packet.  At  first  glance,  the  two 
approaches  may  seem  to  be  only  minor  variations  of  the  full-ack  mechanism;  however,  we  stress 


Tt  is  natural  to  imagine  the  possibility  of  composing  these  approaches.  We  discuss  this  in  Section  4.11. 
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that  there  are  several  challenges  involved  in  ensuring  security  of  these  approaches.  We  now  briefly 
outline  these  approaches  along  with  the  challenges  involved. 

In  the  first  approach  (PAAI-1),  S  monitors  the  path  for  only  a  fraction  of  the  total  traffic.  More 
specifically,  for  a  given  data  packet,  S  solicits  an  ACK  from  every  intermediate  node  only  with 
some  probability  p.  Now,  since  a  fraction  of  traffic  is  unmonitored ,  the  protocol  must  ensure  that  a 
malicious  node  fz  is  not  able  to  determine  from  the  content  of  a  data  packet  m  whether  S  solicits 
an  ACK  for  m.  Otherwise,  on  receiving  m,  if  fz  determines  that  m  need  not  be  acknowledged, 
then  it  could  safely  corrupt  m  without  increasing  its  probability  of  being  identified. 

In  PAAI-2,  S  monitors  the  path  for  every  data  packet,  with  the  provision  that  S  solicits  an 
ACK  for  a  corrupted  data  packet  from  only  one  selected  node  on  the  path.  However,  the  protocol 
must  ensure  that  a  malicious  node  fz  cannot  decipher  the  identity  of  the  selected  node  fe  from  the 
content  of  a  data  packet  m.  Otherwise,  on  receiving  m,  if  fz  determines  that  fe  <  fz  (i.e. ,  whether 
fe  is  upstream  to  or  equal  to  fz),  then  it  could  safely  corrupt  m  without  increasing  its  probability 
of  being  identified. 

In  order  to  circumvent  the  above  attacks  and  still  perform  probabilistic  monitoring,  we  make 
use  of  a  delayed  sampling  mechanism.  Specifically,  in  both  PAAI  protocols,  S  sends  out  an  ack 
request  (henceforth  referred  to  as  a  probe )  at  a  later  time  for  a  data  packet  sent  earlier.  In  PAAI- 
1,  the  probe  conveys  the  information  that  the  corresponding  data  packet  must  be  acknowledged 
(otherwise  no  probe  is  sent).  In  PAAI-2,  the  probe  content  determines  which  intermediate  node 
is  selected.  However,  in  either  protocol,  a  malicious  node  may  withhold  a  data  packet  until  the 
arrival  of  the  corresponding  probe  in  an  attempt  to  decide  whether  to  corrupt  m.  To  circumvent 
this,  we  require  loose  time-synchronization  among  the  nodes  in  the  network  such  that  the  clock 
error  between  two  adjacent  nodes  /*  and  /*+i  is  less  than  min(ro),  i.e.,  the  minimum  value  of  the 
round  trip  time  from  S  to  the  destination.  In  this  scenario,  an  intermediate  node  would  discard  a 
data  packet  if  it  carries  an  expired  timestamp. 

Both  PAAI  protocols  employ  a  scoring  mechanism  in  order  to  identify  malicious  links  over  a 
period  of  time.  We  set  a  threshold  for  the  end-to-end  corruption  rate  of  data  packets  for  a  given 
path.  The  threshold  value  is  chosen  based  on  the  natural  packet  loss  rate,  such  that  the  natural 
end-to-end  loss  rate  will  not  exceed  the  threshold  value.  At  the  end  of  each  probe,  S  computes 
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the  end-to-end  corruption  rate  so  far,  based  on  the  number  of  sent  data  packets  and  successfully 
received  ACKs  from  the  destination;  if  the  corruption  rate  exceeds  the  threshold  value,  then  it 
indicates  that  an  adversary  is  present  on  the  path.  Using  the  history  of  scores  (i.e. ,  the  scores 
accumulated  so  far)  of  the  links,  S  will  identify  the  adversarial  presence  on  a  link  (or  a  set  of 
links)  whose  score  exceeds  a  per-link  score  threshold  within  a  bounded  number  of  probes.  On  the 
other  hand,  the  score  of  an  honest  link  will  not  exceed  the  per-link  score  threshold.  Note  that  this 
mechanism  is  in  sharp  contrast  to  the  on-demand  secure  routing  approach  [19]  where  the  probing 
is  launched  only  when  the  end-to-end  drop  exceeds  a  certain  threshold;  consequently  there  is  no 
history  of  scores  which  can  be  used,  thus  allowing  an  adversary  to  freely  corrupt  packets  until  the 
end-to-end  corruption  rate  reaches  the  threshold  and  then  cause  arbitrary  links  to  be  incriminated 
due  to  natural  packet  loss  when  probing  is  initiated. 

We  now  give  some  details  on  the  specific  scoring  mechanism  employed  by  each  PAAI  protocol. 
Loosely  speaking,  in  PAAI-1,  if  an  intermediate  node  fails  to  return  an  ACK  for  a  probed  data 
packet,  then  S  will  increase  the  corruption  score  of  its  upstream  link.  However,  note  that  if  each 
intermediate  node  were  to  send  a  separate  ack,  then  a  malicious  node  could  selectively  drop  the 
ACKs  from  legitimate  nodes  in  order  to  incriminate  honest  links.  To  circumvent  this,  PAAI-1 
employs  the  use  of  onion  reports  similar  to  the  full- ACK  protocol. 

PAAI-2,  on  the  other  hand,  utilizes  a  slightly  different  scoring  mechanism.  For  a  given  data 
packet,  if  the  selected  node  fe  fails  to  return  an  ack,  then  S  infers  that  there  exists  at  least  one 
malicious  link  upstream  of  /e;  consequently  S  will  increase  the  corruption  score  of  each  link  between 
fe  and  itself.  Now,  suppose  that  a  malicious  packet  corruption  occurred  at  a  link  li-±.  Then,  let  X 
be  the  event  that  the  intermediate  node  /)  is  selected.  We  ensure  that  event  X  occurs  with  a  fixed 
probability.  Due  to  the  above  scoring  mechanism,  each  occurrence  of  X  will  create  a  difference  in 
the  scores  of  the  links  on  either  side  of  /) .  Over  a  period  of  time,  a  difference  in  the  score  of  two 
adjacent  links  would  indicate  a  potential  malicious  link.  In  order  to  ensure  that  event  X  occurs 
with  a  fixed  probability,  PAAI-2  selects  an  intermediate  node  uniformly  at  random  for  any  data 
packet.  The  protocol  must  also  ensure  that  the  identity  of  the  selected  node  for  any  data  packet  is 
not  revealed  at  any  point  in  time ;  otherwise,  a  malicious  node  could  selectively  drop  ACKs  from 
legitimate  nodes  in  order  to  incriminate  honest  links.  Specifically,  in  order  to  incriminate  an  honest 
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link  1}-L.  a  malicious  node  could  drop  the  ACK  every  time  fh+ 1  is  selected,  while  behaving  honestly 
every  time  fh  is  selected.  This  would  create  a  difference  between  the  scores  of  lh-i  and  lh ■  In  order 
to  circumvent  this,  we  design  an  oblivious  selection  and  acknowledgment  procedure,  such  that  the 
identity  of  the  selected  node  is  hidden  to  each  node  (except  S )  even  through  traffic  analysis. 

Finally,  we  remark  that  an  adversary  may  choose  to  modify  or  drop  any  of  the  following: 
(i)  data  packet,  (ii)  probe,  or  (iii)  ack.  However,  our  protocol  design  ensures  that  the  source  node 
S  interprets  each  such  activity  simply  as  a  data  packet  drop.  In  what  follows,  we  will  simply  use 
the  term  drop  to  refer  to  any  kind  of  packet  modification  or  drop.  Looking  ahead,  in  Section  4.8,  we 
show  that  an  adversary  achieves  the  same  total  end-to-end  corruption  rate  by  employing  different 
individual  corruption  rates  for  different  packet  types. 

4.5  The  PAAI  Protocols 

Formally,  the  two  PAAI  protocols  PAAI-1  and  PAAI-2  consist  of  five  stages:  (i)  send  data  and 
decide  whether  to  probe,  (ii)  probe,  (iii)  acknowledge,  (iv)  score,  and  (v)  identify.  We  give  the 
details  of  both  PAAI-1  and  PAAI-2  below. 

4.5.1  PAAI-1 

PAAI-1  employs  probabilistic  sampling  in  order  to  determine  which  data  packets  must  be  acknowl¬ 
edged.  For  every  sampled  data  packet,  PAAI-1  requires  each  intermediate  node  and  the  destination 
to  return  an  onion  report.  The  protocol  details  follow. 

Stage  1:  send  data  and  decide  whether  to  probe 

Consider  that  S  sends  out  a  data  packet  m  =  (data|  (timestamp)  towards  the  destination.  On 
receiving  m,  an  intermediate  node  fi  first  checks  whether  the  embedded  timestamp  is  recent.  If 
verification  fails,  then  m  is  dropped.  Otherwise,  /*  stores  the  identifier  H[m]  for  m  and  starts  a 
wait  timer  ti  =  ro/2.  Finally,  m  is  forwarded  toward  the  destination. 

S  then  uses  a  secure  sampling  (SS)  algorithm  to  determine  whether  it  must  send  out  a  probe 
for  m.  When  given  any  input  m,  the  SS  algorithm  must  output  “Yes”  with  a  fixed  probability  p, 
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where  p  is  the  probe  frequency  fixed  at  setup  time.  Such  an  algorithm  can  easily  be  constructed 
by  making  use  of  a  PRF  keyed  with  a  secret  key  known  only  to  S.  Note  that  such  a  mechanism  is 
necessary  to  prevent  an  adversary  from  correctly  predicting  whether  or  not  a  specific  data  packet 
is  sampled. 

If  the  SS  algorithm  outputs  “No” ,  then  the  protocol  is  terminated  for  the  current  round.  Oth¬ 
erwise,  S  executes  the  next  stage  of  the  protocol.  In  the  following,  it  is  implicit  that  a  node  /, 
accepts  a  packet  (probe  or  ack)  iff  it  contains  a  data  packet  identifier  already  stored  at  /*. 

Stage  2:  probe 

S  sends  out  a  probe  c  =  H[m\  towards  the  destination.  The  probe  contains  the  identifier  H[m\ 
for  the  data  packet  m  sent  earlier.  On  receiving  a  probe,  an  intermediate  node  ft  starts  a  wait- 
timer  ti  =  ri,  forwards  the  probe  towards  the  destination,  and  moves  to  the  next  stage.  Note 
that,  in  practice,  the  probe  frequency  p  will  be  set  to  a  very  low  value.  Therefore,  if  we  use 
unauthenticated  probes,  an  adversary  could  potentially  waste  a  lot  of  communication  power  of  the 
intermediate  nodes  by  sending  bogus  probes.  As  a  countermeasure,  one  could  use  authenticated 
probe  packets,  where  a  chain  of  MACs  (one  for  each  intermediate  node)  is  attached  to  each  probe. 

Stage  3:  acknowledge 

In  this  stage,  the  destination  fd  and  intermediate  nodes  must  return  an  onion  report  to  S.  Ideally, 
the  onion  report  must  either  originate  at  the  destination,  or  at  the  upstream  node  of  the  link  where 
rrij  was  dropped.  To  this  end,  we  employ  the  following  rules:  (i)  If  no  downstream  ACK  is  received 
within  the  wait  time  ti,  fi  originates  an  onion  report  At  =  MACKsi(i\\H[m]).  (ii)  Otherwise,  on 
receiving  a  downstream  ACK  within  the  wait-time,  fi  sets  the  local  report  IZi  to  be  (i\\H[m]) 
to  create  an  onion  report  Ai  as  explained  earlier  in  Section  4.2.  Finally,  fi  sends  out  an  ACK 
di  =  (H[m]\\Ai)  towards  S. 

Stage  4:  score 

Upon  receiving  the  ACK  containing  the  onion  report  from  f\,  S  can  sequentially  verify  each  report 
embedded  in  it.  For  some  i  <  d,  if  the  MAC  from  each  intermediate  node  fj,j  G  [l.i]  is  valid 
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but  the  MAC  from  fi+i  is  invalid  or  not  present  in  the  final  ack,  then  S  identifies  link  h  as  faulty 
and  adds  one  to  its  corruption  score.  In  the  case  where  S  does  not  receive  any  report  within  a 
wait-time,  S  can  simply  conclude  that  a  packet  corruption  occurred  at  its  downstream  link  1$. 

Stage  5:  identify 

At  any  point  in  time,  let  s*  be  the  corruption  score  of  link  li,  and  n  be  the  total  number  of  probes 
evoked  by  S  so  far.  The  average  packet  corruption  rate  p*  for  link  li  so  far  can  be  computed  as 
We  set  a  per-link  corruption  rate  threshold  (denoted  by  T*.)  according  to  the  natural  loss  rate  pi 
( Tdr  >  Pi).  Then  if  p*  >  T^r ,  S  convicts  U  as  a  malicious  link.  More  details  are  given  in  Section  4.8. 

4.6  PAAI- 2 

Now  we  turn  to  the  other  design  alternative:  probabilistically  sampling  a  subset  of  intermediate 
nodes  which  must  return  an  ack.  We  propose  PAAI-2  where  only  one  intermediate  node  is  selected 
to  return  a  report  for  every  data  packet.  We  remark  that  the  strategy  of  selecting  a  subset  of 
intermediate  nodes  which  must  return  an  ACK  tends  to  be  vulnerable  to  selective  dropping  attacks 
(see  Section  4.4).  Consequently,  we  find  that  PAAI-2  requires  more  algorithmic  complexity  but 
achieves  a  higher  detection  delay  than  PAAI-1. 

Stage  1:  send  data  and  decide  whether  to  probe 

Consider  that  S  sends  out  a  data  packet  m  =  (data||timestamp)  towards  fd-  On  receiving  m, 
an  intermediate  node  (including  fd)  first  checks  whether  the  embedded  timestamp  is  recent.  If 
verification  fails,  then  m  is  dropped.  Otherwise,  /*  stores  the  identifier  H[m]  for  m  and  starts  a 
wait  timer  ti  =  ri.  Finally,  m  is  forwarded  toward  the  destination. 

On  receiving  m ,  fd  creates  a  report  Ai  =  MAC Kad{H[m])  and  returns  an  ACK  ad  =  (H[m\\\Ai) 
to  S.  On  receiving  an  ACK  from  fd  within  the  wait-time,  an  intermediate  node  fi  stores  a  copy  of 
it,  forwards  it  towards  S,  and  starts  a  waiting  time  ti  =  r$  —  ri . 

If  S  receives  a  valid  ACK  from  fd  within  a  waiting  time,  it  concludes  that  m  arrived  unaltered 
at  fd  and  the  protocol  is  terminated  for  the  current  round.  Otherwise,  S  executes  the  next  stage 
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of  the  protocol.  In  the  following,  it  is  implicit  that  a  node  fi  accepts  a  packet  (probe  or  ack)  iff  it 
contains  a  data  packet  identifier  already  stored  at  fi. 

Stage  2:  probe 

S  sends  out  a  probe  c  =  (H[m]\\Z)  towards  fd-  The  probe  contains  an  identifier  H[m]  for  m,  and 
a  random  challenge  Z. 

On  receiving  a  probe  within  the  wait-time,  an  intermediate  node  fi  computes  a  PRFiv^O-based 
predicate  Tj  over  input  Z ,  where  T*  returns  “true”  with  probability  d_)+1  ■  If  the  wait-timer  expires, 
then  the  state  maintained  for  m  is  deleted.  In  what  follows,  we  say  that  a  node  fi  is  sampled  for  a 
data  packet  m  if  Tj  returns  true  on  input  1Z. 

Finally,  fi  starts  a  wait-timer  ti  =  r*  and  forwards  the  probe  towards  fd- 

Stage  3:  acknowledge 

In  this  stage,  the  intermediate  nodes  must  return  an  ACK  to  S.  Ideally,  the  ACK  must  originate 
at  the  upstream  node  of  the  link  where  m  was  corrupted.  To  this  end,  we  employ  the  following 
rules:  (i)  If  an  intermediate  node  fi  does  not  receive  any  ACK  from  its  downstream  neighbor  within 
the  wait-time  ti,  it  generates  an  encrypted  report  Ai  =  E^^MAC^^iUcllad)).  If  no  ACK  was 
received  from  fd  in  stage  1,  then  ad  is  set  to  _L.  (ii)  Otherwise,  on  receiving  a  downstream  ACK 
within  the  wait-time,  fi  performs  one  of  the  following  actions.  If  fi  was  sampled  for  m  during  stage 
1,  it  generates  an  encrypted  report  Ai  (as  described  in  previous  case)  to  overwrite  the  report  in 
the  received  ack.  Otherwise  it  re-encrypts  the  report  in  the  received  ack,  i.e. ,  Ai  =  E^si(Aj+i). 
The  security  reason  for  the  re-encryption  is  given  in  Section  4.7.  Finally,  fi  sends  out  an  ack 
ai  =  (H[m]\\Ai)  towards  S. 

Definition  6.  We  say  that  a  node  fe  is  selected  for  a  data  packet  m,  if  (i)  fe  is  sampled  for  m, 
and  (ii)  f\, . . . ,  fe-\  are  not  sampled  . 

From  the  above  definition,  it  follows  that,  for  a  given  data  packet,  only  one  intermediate  node 
is  selected  uniformly  at  random  with  probability  d .  Observe  that  due  to  the  ACK  forwarding 
mechanism  described  above,  S  expects  an  ACK  that  was  generated  at  the  selected  node  fe  and 
re-encrypted  by  each  upstream  node  between  fe  and  S. 
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Stage  4:  score 

In  this  stage,  S  assigns  numerical  scores  to  the  links.  On  receiving  an  ACK  from  f±,  S  first  decodes 
the  embedded  report  A™  by  performing  successive  decryption  using  the  keys  Ks\, . . .  ,Kse  in  that 
order,  where  Kse  is  the  secret  key  shared  between  S  and  fe.  If  the  final  decoded  value  matches  the 
expected  value  (MAC^'se(e||c)),  then  S  decides  that  there  was  no  malicious  activity  in  the  interval 
[lo,le-i\;  consequently,  no  scores  are  updated.  Otherwise,  S  is  convinced  that  there  exists  at  least 
one  malicious  link  in  the  interval  [Zo,  le- 1]-  Since  each  link  in  this  interval  has  equal  probability  of 
being  malicious,  S  adds  1  to  the  individual  score  of  each  link  in  the  interval.  No  scores  are  updated 
for  the  links  in  the  interval  [le,ld~ i]. 

Stage  5:  identify 

S  pre-determines  a  per-link  corruption  rate  threshold  T^r ,  based  on  which  it  further  sets  a  threshold 
'i/’t/i  for  the  end-to-end  corruption  rate  of  data  packets.  S  constantly  monitors  the  actual  end-to-end 
data  packet  corruption  rate  ^  based  on  the  number  of  sent  data  packets  and  successfully  received 
ACKs  from  the  destination.  It  is  guaranteed  that  ifth  <  if  there  is  at  least  one  link  with  a 
corruption  rate  exceeding  T^r.  Then  the  source  can  compute  per-link  corruption  rate  based  on  the 
accumulated  data  and  identify  the  link  with  the  excessive  corruption  rate.  More  details  are  given 
in  Section  4.8. 

4.7  Security  Properties 

In  order  to  prove  our  theoretical  results  in  section,  we  require  the  PAAI  protocols  to  exhibit  some 
key  security  properties.  Below,  we  discuss  four  important  security  properties  of  the  PAAI  protocols. 

Delayed  Sampling.  Recall  that  in  PAAI-1,  for  a  given  data  packet,  S  solicits  an  ACK  only 
with  some  probability  p.  We  note  that  a  malicious  node  fm  should  not  be  able  to  decipher  from 
the  content  of  a  data  packet  m  whether  S  solicits  an  ACK  for  m.  Otherwise,  on  receiving  m,  if 
fm  determines  that  m  need  not  be  acknowledged,  it  could  safely  corrupt  m  without  increasing  its 
probability  of  being  identified.  Now  recall  from  Stage  3  of  PAAI-2  that  for  a  given  data  packet, 
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a  sampled  node  must  overwrite  the  ACK  received  from  its  downstream  neighbor  with  a  fresh  ack. 
Hence,  we  note  that  on  receiving  a  data  packet  m,  if  a  malicious  node  could  decipher  from  its  content 
whether  it  is  sampled  for  m,  then  it  could  safely  corrupt  m  without  increasing  its  probability  of 
being  identified. 

To  prevent  the  above  attacks,  in  both  PAAI  protocols,  a  probe  c  is  sent  at  a  later  time  to  request 
ACKs  for  a  data  packet  sent  at  an  earlier  time.  In  PAAI-1,  the  probe  conveys  the  information  that 
the  corresponding  data  packet  must  be  acknowledged.  In  PAAI-2,  the  probe  content  determines 
whether  an  intermediate  node  is  sampled.  However,  in  both  PAAI  protocols,  a  malicious  node  may 
now  try  to  wait  for  the  arrival  of  the  probe  c  before  forwarding  m,  in  an  attempt  to  decide  whether 
to  drop  m.  Therefore,  we  require  loose  time-synchronization  amongst  the  nodes  in  the  network 
such  that  the  clock  error  between  two  adjacent  nodes  /*  and  fi+\  is  less  than  mm(ro),  i.e.  the 
minimum  value  of  the  round  trip  time  from  S  to  the  destination.  In  this  scenario,  an  intermediate 
node  would  discard  a  data  packet  that  carries  an  expired  timestamp. 

Security  against  selective  packet  corruption.  It  is  easy  to  observe  that  the  use  of  onion 
report  mechanism  prevents  any  selective  dropping  attacks  by  an  adversary  in  PAAI-1.  Now  recall 
that  in  PAAI-2,  if  an  intermediate  node  does  not  receive  any  ACK  within  a  wait-time,  it  generates 
a  new  ACK  even  if  it  is  not  sampled;  otherwise  an  adversary  could  observe  the  ACK  origin  to 
infer  whether  an  intermediate  node  is  sampled.  Further,  the  re-encrypt  or  overwrite  technique  in 
PAAI-2  ensures  that  a  constant  size  ACK  is  forwarded  at  each  hop.  If  this  were  not  the  case,  then 
an  adversary  who  eavesdrops  at  all  the  links  on  the  path  to  observe  any  difference  in  the  size  of 
the  ACK  at  various  links  can  infer  additional  information  about  the  origin  of  the  ack.  Furthermore 
in  PAAI-2,  for  a  given  data  packet,  the  probed  node  is  selected  uniformly  at  random;  otherwise 
an  adversarial  node  can  simply  preferentially  perform  data-plane  attacks  at  nodes  that  are  not  as 
likely  to  be  sampled  as  others. 

Adversary  localization.  In  PAAI-1,  for  each  sampled  data  packet  (i.e.,  the  data  packet  for 
which  an  onion  report  is  requested)  that  was  corrupted,  S  can  localize  the  location  of  the  packet 
corruption  to  a  specific  link  from  the  verification  of  the  onion  report.  Now  recall  that  in  PAAI-2, 
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for  a  given  data  packet,  the  ACK  expected  by  S  is  the  one  that  is  generated  by  the  selected  node. 
Therefore,  if  the  selected  node  is  located  between  S  and  the  adversary,  then  the  adversary  cannot 
influence  the  final  ack  received  at  S.  This  implies  that  if  no  ACK  or  an  invalid  ACK  is  received  at 
5,  then  there  must  exist  at  least  one  malicious  link  in  the  interval  [lo,le-\]. 

4.8  Theoretical  Analysis 

In  this  section,  we  theoretically  analyze  the  guaranteed  end-to-end  forwarding  correctness,  detection 
delay,  communication  and  storage  overhead  of  the  proposed  protocols.  Proofs  of  the  theorems  and 
corollaries  are  given  in  the  appendix.  The  results  are  summarized  in  Table  4.1,  which  also  gives 
a  clear  comparison  between  the  full-ack,  PAAI,  and  statistical  FL  protocol  [21].  We  compare  our 
PAAI  protocols  mainly  with  the  statistical  FL  protocol  because  it  is  the  state-of-the-art  and  the 
only  protocol  with  a  rigorous  theoretical  analysis  to  the  best  of  our  knowledge.  In  Section  4.9,  we 
validate  our  theoretical  results  and  present  average-case  results  from  simulations. 

Definitions  and  notation.  Let  pi  be  the  natural  packet  loss  rate  of  link  li,  and  suppose  that 
Pi  s  are  i.i.d.  random  variables  with  maximum  value  p.  Let  denote  the  per-link  corruption 
rate  threshold;  and  p*  be  the  actual  average  corruption  rate  of  link  li,  including  both  natural  and 
malicious  corruption.  Let  £  be  the  malicious  end-to-end  corruption  rate,  i.e. ,  the  corruption  rate 
due  to  malicious  links.  When  the  observed  corruption  rate  value  approaches  its  true  value  within 
a  small  uncertainty  interval,  the  fault  localization  false  positive/negative  rate  is  limited  below  a 
certain  threshold  e.  We  call  this  the  converged  condition. 

Let  p  be  the  probe  frequency  employed  in  PAAI-1.  Further,  in  PAAI-2,  let  V’th  be  the  threshold 
of  the  end-to-end  data  packet  corruption  rate.  Let  rn  be  the  number  of  times  that  node  fi  is  selected 
so  far. 

4.8.1  Bounding  Malicious  End-to-End  Corruption  Rate 

For  ease  of  understanding,  all  the  theoretical  bounds  in  this  subsection  are  computed  under  the 
converged  condition.  In  Section  4.8.2  we  derive  the  detection  delay  (number  of  data  packets  sent 
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by  the  source  required  to  reach  converged  condition)  for  the  full-ACK  and  PAAI  schemes.  We  can 
see  the  detection  rates  are  high  in  the  full  ACK  and  PAAI-1  schemes,  so  the  “unconverged”  time 
period  is  negligible. 

For  simplicity,  we  first  assume  that  an  adversary  employs  an  identical  corruption  rate  for  all 
types  of  packets  (data,  probe  or  ACK  packets)  at  a  controlled  link  f,  and  thus  the  probability  that 
a  packet  of  any  kind  is  corrupted  at  U  is  p*.  The  following  theorem  proves  the  (fl,  0)-guranteed 
forwarding  correctness  (Definition  4)  and  (a,  [I ) 5 -forwarding  security  (Definition  5)  of  full-ACK, 
PAAI-1,  and  PAAI-2.  Since  PAAI  does  not  consider  packet  injection  attacks,  (3  is  inapplicable 
here.  In  addition,  D  equals  to  the  number  of  malicious  links  in  the  network,  which  is  explained 
in  Section  1.1.  The  following  theorem  also  provides  a  general  bound  on  the  damage  that  an 
adversary  with  an  arbitrary  number  of  links  under  its  control  can  inflict  to  the  network’s  end-to- 
end  throughput. 

Theorem  7.  Forwarding  Security  and  Correctness:  Given  a  path  of  length  d,  the  fractions 
(a)  of  packets  an  adversary  can  drop  on  any  link  without  being  detected  in  full-ACK,  PAAI-1,  and 
PAAI-2  are:  (i)  a  =  T*.  in  full-ACK  and  PAAI-1,  and  (ii)  a  =  1—  in  PAAI-2  by  setting 

the  end-to-end  corruption  rate  threshold  1 fth  as  ifth  =  1  —  (1— Tdr)2d,  respectively.  And  the  guaranteed 
forwarding  correctness  is  6  =  (1  —  T*.)d,  given  the  drop  detection  threshold  T^r. 

In  general,  an  adversary  in  control  of  z  intermediate  links  can  cause  (at  most)  the  following 
malicious  end-to-end  corruption  rates  without  being  detected:  (i)  Q  =  zT^r  in  full-ACK  and  PAAI- 
1,  and  (ii)  £  =  1  —  in  PAAI-2  by  setting  the  end-to-end  corruption  rate  threshold  ifth  as 

Vu  =  l-(1  ~Tdr)2d. 

It  is  possible  that  an  adversary  may  choose  to  corrupt  different  types  of  packets  at  different 
rates.  However  we  can  intuitively  see  that  the  adversary  cannot  gain  any  advantage  by  doing  this, 
because  corrupting  any  type  of  packet  will  always  result  in  an  increase  in  the  corruption  score  of 
the  link  where  the  packet  was  corrupted. 

Corollary  8.  An  adversary  who  employs  different  corruption  rates  for  different  types  of  packets 
achieves  the  same  maximum  end-to-end  corruption  rate. 


Corollary  9  presents  the  optimal  strategy  that  an  adversary  can  employ  in  order  to  cause 
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maximum  degradation  to  the  network  throughput.  The  corresponding  bounds  on  the  degradation 
in  network  throughput  under  the  optimal  strategy  are  also  presented. 

Corollary  9.  Given  a  fixed  number  of  malicious  links,  the  malicious  end-to-end  corruption  rate 
C  increases  approximately  linearly  with  the  increase  of  natural  loss  rate  p.  Given  a  fixed  number 
z  of  malicious  links,  the  optimal  strategy  for  the  adversary  in  order  to  cause  the  maximum  end- 
to-end  corruption  rate  across  all  the  paths  containing  malicious  links  in  the  network  is  to  deploy 
one  malicious  link  for  one  path.  In  this  case,  the  total  malicious  corruption  rate  across  all  paths 
containing  compromised  links  increases  linearly  with  z. 

4.8.2  Detection  Delay 

We  compute  the  detection  delays  and  prove  the  (N,  d)-data-plane  fault  localization  (Definition  3) 
for  the  full-ACK  scheme  and  PAAIs  in  the  following  theorem. 

Theorem  10.  ( N,S )-  Data-Plane  Fault  Localization:  Given  the  threshold  T^r  =  p  +  e  and  the 
allowed  false  positive  rate  5,  the  full-ACK  and  the  PAAI  protocols  require  the  following  number  of 
packets  transmitted  by  the  source  to  converge,  (i)  Ni  =  8£2.[”<Lp)2 +d  for  full-ACK  scheme,  (ii)  N2  = 
for  PAAI-1,  where  p  is  the  probe  frequency,  and  (Hi)  N3  =  •  d  ■  log(d)  for  PAAI-2. 

Corollary  11  shows  the  sensitivity  of  the  detection  delay  (achieved  by  the  full-ACK  and  PAAI 
protocols)  to  the  various  protocol  parameters.  As  it  turns  out,  PAAI-1  can  achieve  shorter  detection 
delays  under  various  parameter  settings  (and  thus,  a  wide  range  of  empirical  scenarios). 

Corollary  11.  For  both  the  full-ACK  scheme  and  PAAI-1,  the  allowed  false  positive  rate  5  is  the 
dominating  factor  on  their  detection  delays,  while  the  network-related  parameters  (natural  packet 
loss  rate  p  and  path  length  d)  have  negligible  influence  on  the  detection  delays.  However,  the 
detection  delay  of  PAAI-2  heavily  depends  on  the  path  length  d. 

For  example,  if  we  set  6  =  0.03  and  p  =  and  choose  an  arbitrary  network  setting  where 
Tdr  =  0.03,  p  =  0.01  and  d  =  6,  then  we  have  Ni  =  1500,  IV2  =  5  x  104  and  IV3  =  6  x  105;  whereas 
the  detection  delay  in  statistical  FL  protocol  [21]  is  2  x  107.  Per  Corollary  11,  the  detection  delay 
for  PAAI-1  does  not  vary  much  given  other  network-related  parameter  settings.  Table  4.1  compares 
the  detection  delays  achieved  by  the  different  protocols. 
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Protocol 

Detection  Delay 

Communication 

Storage 

worst  ideal 

Full-ACK 

i»(|) 

8e2-(l —p)2+d 

0(1  +  ipd) 

O(2r0^)  O(r0z/) 

PAAI-1 

„  Ml) 

"  St2(i—p)2+d 

0  (pd) 

O(r0(0.5  +  p)u)  O(ro(0.5  +  p)v) 

PAAI-2 

2di ®  '  d  •  log(d) 

0(1) 

O(2r0  u)  O(r0z/) 

Statistical  FL  [21] 

d?‘4 

pc 

°(^) 

0(prQu)  O(proi') 

Combination  1 

n  Mf) 

"  8e2-L —p)2+d 

0(p(l  +  ipd)) 

O(r0(0.5  +  2p)v)  O(r0(0.5  +  2p)v) 

Combination  2 

2 

0  (P) 

O(ro(l  +  p)v)  O(r0z/) 

Table  4.1:  Detection  rate  and  overhead  comparison.  The  notation  is  given  at  the  beginning  of  Sec¬ 
tion  4.8.  We  translate  the  related  results  [21]  using  our  notation.  Combination  1  and  Combination 
2  are  described  in  Section  4.11. 

4.8.3  Communication  Overhead 

In  this  section  we  compute  and  compare  the  communication  overhead  incurred  by  the  full-ACK  and 
the  PAAI  protocols  for  a  given  path  of  length  d.  The  analysis  results  are  presented  in  Table  4.1. 

Full-ack.  Recall  from  Section  4.3  that  in  the  benign  case  where  no  packet  corruption  occurs,  each 
data  packet  requires  one  0(l)-sized  ACK  from  the  destination.  When  a  packet  corruption  happens, 
the  source  solicits  a  0(d)-sized  onion  report  via  a  0(l)-sized  probe  packet.  Therefore,  given  the 
end-to-end  corruption  rate  ip,  the  overall  communication  overhead  per  packet  is  0(1  +  dip). 

PAAI-1.  Recall  from  Section  4.5.1  that  for  each  sampled  data  packet,  the  source  solicits  one  0(d)- 
sized  onion  report  (in  case  of  authenticated  probes,  the  size  of  a  probe  packet  is  also  0(d)).  Since 
a  given  data  packet  is  sampled  only  with  probability  p,  the  amortized  communication  overhead 
per  data  packet  is  O (pd).  By  setting  p  =  we  can  get  O(^)  overall  communication  overhead 
per  packet.  Note  that  the  above  results  apply  regardless  of  whether  there  are  packet  corruption 
activities  or  not. 

PAAI-2.  Recall  from  Section  4.6  that  each  intermediate  node  fi  on  the  forwarding  path  either 
generates  a  new  ACK  or  re-encrypts  the  ACK  received  from  downstream.  Therefore,  an  ACK  packet 
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traversing  the  path  has  a  constant  size  (0(1))  at  any  point  in  time.  Further,  PAAI-2  requires  one 
0(l)-sized  probe  packet  per  data  packet  sent  by  the  source.  Note  that  the  above  results  apply 
regardless  of  whether  there  are  packet  corruption  activities  or  not. 

4.8.4  Storage  Overhead 

Storage  is  a  major  concern  in  certain  resource-constrained  networks.  An  adversary  may  even  exploit 
the  storage  limitation  and  manipulate  packet  corruption  activities  to  intentionally  create  the  worst 
case  condition  for  the  storage  overhead  of  an  fault  localization  protocol.  On  the  other  hand,  in 
practical  settings,  including  when  the  adversary  has  been  identified  (and  bypassed),  excessive  packet 
corruption  is  infrequent  (thus  the  worst  cases  do  not  arise  frequently).  A  high  storage  overhead  in 
such  an  ideal  case  is  undesirable.  Therefore,  in  this  section  we  analyze  and  compare  the  storage 
overhead  in  both  worst  and  ideal  cases  for  the  full-ACK  scheme  and  PAAIs.  In  Section  4.9  we 
present  the  average-case  storage  overhead  via  simulations. 

In  the  following,  let  u  be  the  number  of  data  packets  that  S  sends  out  per  unit  time.  Recall  that 
rj  denotes  the  round  trip  time  between  node  /,.  and  fd .  The  results  given  below  are  summarized  in 
Table  4.1. 

Full-ack.  In  the  worst  case,  on  receiving  a  data  packet  m,  an  intermediate  node  fi  needs  to  first 
wait  vq  time  for  a  probe  from  the  source,  and  r*  time  for  an  ACK  from  /j+i.  Therefore  fi  can  at 
most  store  0(2ro^)  packets  at  a  time.  In  the  ideal  case  without  packet  drop,  fi  only  needs  to  store 
a  packet  for  n  time  before  receiving  an  ACK  from  fi+\. 

PAAI-1.  If  a  data  packet  m  is  not  selected  for  a  probe,  fi  needs  to  wait  ^  time  for  a  probe 
packet  from  the  source.  If  m  is  selected  for  a  probe,  in  the  worst  case  fi  needs  to  further  wait  r* 
time  for  an  ACK  from  fi+\;  whereas  in  the  ideal  case,  fi  needs  to  further  wait  rj  time  for  the  ACK 
from  fi+\ .  Therefore  given  the  probe  frequency  p ,  fi  can  at  most  store  (0.5  +p)ro  x  u  packets  at  a 
time  in  both  the  worst  and  ideal  cases. 

PAAI-2.  In  the  worst  case,  on  receiving  m,  fi  waits  rj  time  for  an  ACK  from  /j+i,  ro  —  r*  time 
for  a  probe  from  the  source,  and  r*  time  for  an  ACK  from  /)+ 1  again,  which  gives  the  worst  case 
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storage  overhead  0(2ro^).  In  the  ideal  case,  ft  only  needs  to  wait  r,  time  for  the  ACK  from  ,ft+i. 
Therefore  in  ideal  case  the  storage  bound  is  O(ro  x  u)  packets  at  a  time. 

4.9  Simulation  Results  and  Analysis 

We  implement  a  simulator  to  study  the  average-case  performance  of  the  proposed  protocols,  and 
also  contrast  the  average-case  results  with  the  theoretical  results  (as  listed  in  Table  4.2).  Through 
simulations,  we  not  only  validate  our  theoretical  results  and  make  comparisons,  but  also  derive  new 
observations  missing  from  theoretical  analysis  by  itself. 

4.9.1  Methodology 

Adversary.  Note  that,  in  practice,  an  adversary  usually  directly  compromises  a  node ,  corrupting 
the  traffic  flowing  through  that  node  at  the  adversary’s  will.  We  emulate  such  a  realistic  scenario 
by  setting  malicious  nodes  in  the  path  to  perform  malicious  packet  corruption  activity.  We  simulate 
the  adversary’s  optimal  strategy  by  deploying  exactly  one  malicious  node  on  the  path  (Corollary  8). 
Recall  that,  in  our  protocols,  if  a  malicious  node  corrupts  packets,  it  can  manifest  high  corruption 
rates  only  on  its  adjacent  links.  We  also  set  the  adversary  to  employ  the  following  tactics:  (i)  Since 
the  full-ACK  scheme  and  PAAI  protocols  ensure  that  the  adversary  cannot  gain  benefit  by  cor¬ 
rupting  different  packets  at  different  rates  (Corollary  8),  the  adversary  corrupts  all  types  of  packets 
at  the  same  rate,  (ii)  Without  loss  of  generality,  we  assume  that,  when  the  malicious  node  receives 
but  corrupts  a  data  packet,  on  receiving  an  ACK  request  it  will  still  send  back  the  ACK  as  if  it 
were  functioning  correctly.  In  this  way,  a  malicious  node  ft  s  corruption  activity  always  increases 
the  corruption  score  of  its  downstream  adjacent  link  Therefore  U  is  the  target  to  identify. 

Topology  and  Parameters.  Recall  the  example  topology  given  in  Figure  2.1.  We  simulate  the 
proposed  protocols  on  one  path  with  various  lengths  and  varying  locations  of  the  malicious  link. 
Due  to  lack  of  space,  here  we  only  present  the  results  for  an  arbitrary  setting  where  d  =  6  and  f 4  is 
set  to  be  the  node  controlled  by  the  adversary  (results  from  other  settings  present  similar  trends  and 
conclusions).  According  to  our  aforementioned  adversarial  setting,  the  malicious  packet  corruption 
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will  directly  increase  /4’s  corruption  score;  thus  I4  is  the  target  link  for  our  fault  localization  protocols 
to  identify.  In  the  following  we  also  call  I4  as  the  malicious  link  Im •  We  follow  the  example 
parameter  settings  used  in  our  previous  theoretical  analysis,  i.e.,  we  set  benign  per-link  loss  rate 
threshold  p  =  0.01  and  malicious  per-link  corruption  rate  T*.  =  0.03  (we  implement  this  by  setting 
a  corruption  rate  of  0.02  for  the  malicious  node  f^).  However,  recall  from  Corollary  11  and  Table  4.1 
that  the  performance  of  PAAI-1  does  not  degrade  in  the  case  of  longer  paths  and  higher  natural 
loss  rates.  Each  packet  traversing  a  link  (or  the  malicious  node)  has  an  independent  probability 
of  being  corrupted  bi-directionally  below  the  corresponding  corruption  rate  threshold  of  that  link 
(or  the  malicious  node).  We  also  set  per-link  bi-directional  latency  distributed  within  0  to  5  ms 
uniformly  at  random. 

Evaluation  Metrics.  We  evaluate  (i)  fault  localization  false  positive  and  negative  rates  (which 
directly  relate  to  detection  delays)  and  (ii)  storage  overhead  of  each  node  for  the  full-ACK  and 
PAAI  protocols.  We  did  not  simulate  the  communication  overhead  because  the  theoretical  analysis 
already  gives  straightforward  and  tightly  bounded  results.  We  run  the  simulation  10000  times  for 
each  protocol  to  calculate  the  false  positive  and  false  negative  rates  and  plot  their  dynamics  over 
time.  Recall  from  Table  4.1  that  storage  overhead  directly  depends  on  packet  origination  rate;  as 
such  we  evaluate  it  for  different  orders  of  origination  rate:  1000  and  100  data  packets  per  second 
(the  storage  overhead  under  a  source’s  sending  rate  of  10  packets  per  second  is  too  low  to  exhibit 
any  insightful  traits). 

4.9.2  Results  and  Analysis 

As  presented  below,  we  are  able  to  both  validate  our  theoretical  results  and  to  derive  new  and 
interesting  observations  from  the  simulation  results. 

False  positive  and  negative  rates.  Figure  4.1  plots  the  false  positive  and  false  negative  rates 
observed  from  10000  simulation  runs  for  each  protocol.  From  the  figure  we  can  observe  that,  given 
the  same  false  positive  threshold  5  =  0.03,  the  detection  delays  are  nearly  half  of  the  corresponding 
theoretical  bounds.  We  summarize  the  comparisons  between  theoretical  and  experimental  results  in 
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(a)  Full-ACK  scheme.  We  use  logarithmic  scale  for  the 
y-axis. 


(b)  PAAI-1.  We  use  logarithmic  scale  for  the  y-axis. 


(c)  PAAI-2.  We  use  logarithmic  scale  for  both  axes. 

Figure  4.1:  False  positive  and  negative  rates.  The  time  is  measured  by  the  number  of  packets  sent 
by  the  source. 
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(c)  Storage  traits  with  sending  rate  =  1000  pkt/sec  in  full- 
ACK  scheme. 


Figure  4.2:  Storage  overhead.  The  storage  is  measured  by  the  number  of  packets  stored  at  any 
given  time. 
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Protocol 

Detection  Delay  (minutes) 

bound  average 

Storage  (#  pkt) 

bound  average 

Full-ACK 

0.25  0.17 

12  3.2 

PAAI-1 

9  4.2 

3.2  3.0 

PAAI-2 

100  50 

12  6.4 

Statistical  FL  [21] 

3333  N/A 

<  1  N/A 

Table  4.2:  Comparison  of  detection  rates  between  theoretical  results  and  simulation  results.  The 
source’s  sending  rate  is  set  to  100  data  packets  per  second.  The  storage  overhead  is  the  average 
number  of  packets  stored  in  with  the  presence  of  a  malicious  link  1 4. 

Table  4.2.  In  addition,  we  can  see  that  in  PAAI-2,  the  source  takes  more  time  to  accurately  observe 
the  per-link  corruption  rate  for  a  link  farther  away  from  the  source.  This  fact  can  be  theoretically 
proved  via  the  mathematical  formula  (we  defer  the  proof  to  the  full  version). 

Storage  overhead.  We  launch  two  different  sets  of  simulations  to  study  the  characteristics  of 
storage  overhead  in  fault  localization  protocols.  In  each  scenario  if  a  fault  localization  protocol 
reaches  the  converged  condition  (after  103  ,  2.5  x  104  and  3  x  105  data  packets  sent  by  the  source  in 
full-ack,  PAAI-1  and  PAAI-2  schemes,  respectively),  we  assume  the  source  bypasses  the  identified 
I4  by  replacing  f4  with  a  honest  node  f'4  to  connect  nodes  / 3  and  (we  implement  this  in  the 
simulation  by  resetting  /4’s  corruption  rate  to  zero).  We  label  cases  where  adversary  identification 
comes  into  play  as  “w/  FL” .  We  also  simulate  the  case  where  the  existing  adversary  is  not  identified 
and  bypassed,  which  is  labeled  as  “w/o  FL”. 

We  first  investigate  the  storage  overhead  of  a  single  node  f\  (which  has  the  highest  storage 
overhead,  as  we  show  later)  under  different  source’s  sending  rates  (1000  and  100  data  packets  per 
second).  We  first  let  the  source  send  2000  data  packets  in  total,  within  which  only  the  full-ACK 
scheme  can  reach  the  converged  condition.  However,  we  present  the  results  for  the  full-ACK  scheme 
in  both  “w/  FL”  and  “w/o  FL”  cases  to  compare  with  the  PAAI  protocols.  Figures  4.2(a)  and 
4.2(b)  present  /i’s  storage  overhead  when  the  source’s  sending  rate  is  1000  or  100  data  packets 
per  second,  respectively.  It  is  apparent  that  the  storage  overhead  decreases  with  the  lower  sending 
rate.  We  further  observe  that,  in  the  “w/o  FL  case,  PAAI-1  possesses  the  lowest  storage  overhead; 
and  the  storage  overhead  of  each  protocol  increases  roughly  linearly  with  the  source’s  sending  rate. 
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This  fact  complies  with  our  theoretical  bounds  (Table  4.1).  In  addition,  it  is  clear  that  the  full- 
ACK  scheme  achieves  a  lower  storage  overhead  after  bypassing  the  adversary  (“w/  FL”).  Therefore, 
though  the  full-ACK  scheme  presents  the  highest  theoretical  bound  of  worst-case  storage  overhead, 
it  achieves  the  lowest  storage  overhead  in  practice  when  fault  localization  comes  into  play.  This 
observation  implies  that,  in  essence,  a  protocol  with  a  lower  detection  delay  benefits  more  in  the 
ideal  cases  where  packet  corruption  activities  are  rare  after  the  adversary  is  quickly  bypassed. 

In  another  simulation,  we  investigate  the  storage  overhead  of  nodes  at  different  locations  in  the 
path  and  the  influence  of  fault  localization  on  storage  overhead.  Since  the  full-ACK  scheme  has 
the  lowest  detection  delay,  we  only  present  the  simulation  results  of  the  full-ACK  scheme  due  to 
space  limitations  (the  results  derived  from  other  protocols  present  common  trends).  To  make  the 
influence  of  fault  localization  more  graphically  obvious,  we  enlarge  the  corruption  rate  of  to  0.1. 
In  this  simulation  we  let  the  source  send  2000  data  packets  at  the  rate  of  1000  data  packets  per 
second,  and  bypass  the  adversary  after  sending  1000  data  packets.  Figure  4.2(c)  plots  the  resulting 
dynamics  of  the  storage  overhead  of  nodes  f\,  fo  and  fa,  from  which  we  can  observe  that,  nodes 
closer  to  the  destination  have  lower  storage  overhead  and  are  less  affected  after  adversarial  packet 
corruption.  This  observation  can  be  explained  according  to  the  theoretical  analysis  in  Section  4.8.4. 

4.10  Summary  of  Results 

From  the  theoretical  and  experimental  results,  we  can  make  the  following  major  observations: 

Theory  vs.  Simulation.  The  average-case  results  derived  from  our  simulations  are  within  the 
corresponding  theoretical  bounds.  For  the  detection  delay,  the  average-case  results  are  nearly  two 
times  better  than  the  corresponding  theoretical  results.  For  the  storage  overhead,  the  average- 
case  result  of  the  full-ACK  scheme  is  far  smaller  than  its  worst-case  bound,  thanks  to  its  fast 
convergence.  The  PAAI-1  protocol  also  presents  low  storage  overhead,  even  with  the  presence  of 
an  adversary. 

Practicality.  We  make  the  following  conclusions  about  the  trade-off  between  the  three  perfor¬ 
mance  metrics  achieved  by  the  various  protocols:  (i)  The  full-ACK  scheme  offers  the  shortest 
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detection  delay  and  incurs  a  low  storage  overhead,  but  at  the  cost  of  impractical  communication 
overhead,  (ii)  The  PAAI-1  protocol  offers  a  practical  (though  not  the  best)  detection  delay  and 
communication  and  storage  overhead  simultaneously.  More  specifically,  given  that  each  data  packet 
is  1.5KB  (which  is  the  currently  popular  MTU  standard),  per  Figures  4.2(a)  and  4.2(b),  PAAI-1 
introduces  less  than  45KB  additional  storage  overhead  even  at  its  peak  value  under  an  intense 
packet  sending  rate  of  1.5MB  per  second,  and  around  6KB  at  its  peak  value  under  a  packet  send¬ 
ing  rate  of  150KB  per  second.  Furthermore,  by  setting  the  sampling  rate  p  =  ^p.  PAAI-1  poses 
only  around  3%  additional  communication  overhead  in  a  path  with  length  d  =  6,  per  Table  4.1; 
while  the  detection  delay  is  45  minutes  given  by  the  theoretical  bound,  and  around  20  minutes  on 
average  per  Table  4.2  (in  previous  analysis  and  simulation  we  set  p  =  ^-).  (iii)  The  PAAI-2  proto¬ 
col  presents  worse  performance  compared  to  the  full-ACK  scheme  and  PAAI-1  protocol,  but  still 
presents  a  more  practical  detection  delay  compared  to  the  statistical  FL  scheme  [21]  (see  below), 
(iv)  The  statistical  FL  protocol  [21]  incurs  almost  optimal  communication  and  storage  overhead, 
but  achieves  a  rather  impractical  detection  delay  -  nearly  50  hours  in  the  worst  case  (Table  4.2). 
We  conclude  that  PAAI-1  offers  the  most  desirable  trade-off  between  the  performance  metrics.  In 
contrast,  all  the  other  protocols  only  optimize  at  most  two  performance  metrics  at  the  cost  of 
deteriorating  the  other  metric(s)  undesirably. 


4.11  Combination 

So  far  we  have  explored  three  different  basic  approaches,  namely:  (i)  every  node  acknowledges 
every  corrupted  data  packet  (exemplified  by  the  full-ACK  scheme),  (ii)  every  node  acknowledges  a 
selected  fraction  of  data  packets  (instantiated  by  the  PAAI-1  protocol),  and  (iii)  a  selected  subset 
of  nodes  acknowledge  every  data  packet  (represented  by  the  PAAI-2  protocol).  Intuitively,  it  might 
be  tempting  to  consider  combinations  of  the  above  basic  approaches  in  order  to  improve  upon 
a  certain  performance  metric.  However,  as  we  demonstrate  below,  the  combinations  may  not 
necessarily  achieve  a  better  trade-off  between  the  performance  metrics  as  compared  to  the  basic 
approaches,  and  may  therefore  be  unfavorable  in  practice.  Specifically,  although  a  combination 
may  further  optimize  a  certain  performance  metric,  other  metrics  can  degrade  undesirably  at  the 
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same  time.  Due  to  lack  of  space,  we  will  briefly  discuss  two  sample  combinations  and  analyze  the 
corresponding  tradeoff. 


Combination  1.  By  combining  the  basic  approaches  (a)  and  (b)  above,  we  can  design  a  protocol 
where  every  node  must  acknowledge  a  selected  fraction  of  corrupted  data  packets.  The  PAAI- 
1  protocol  can  be  easily  modified  to  follow  the  above  approach.  Specifically,  instead  of  using  a 
secret  key  known  only  to  S  to  implement  the  probe  function,  we  will  use  the  secret  key  K d  shared 
between  S  and  fd ■  Now,  on  receiving  a  data  packet,  fd  can  independently  decide  whether  it  must 
be  acknowledged.  For  a  sampled  data  packet  m,  S  will  send  out  a  probe  only  if  it  fails  to  receive  an 
ack  from  fd .  The  remaining  details  follow  from  PAAI-1.  While  retaining  the  same  detection  delay 
as  PAAI-1,  the  new  protocol  further  reduces  the  communication  overhead,  since  S  now  solicits  an 
onion  report  for  only  a  corrupted  sampled  packet  (instead  of  every  sampled  packet  in  PAAI-1). 
However,  the  storage  overhead  increases:  in  the  worst  case,  on  receiving  m,  each  node  must  first 
wait  an  additional  ro  time  for  an  ACK  from  fd,  such  waiting  time  which  was  not  required  in  PAAI-1. 
Its  performance  is  summarized  in  Table  4.1. 


Combination  2.  By  combining  the  basic  approaches  (b)  and  (c)  above,  we  can  design  a  protocol 
where  one  selected  node  acknowledges  a  selected  fraction  of  data  packets.  Similar  to  Combination 
1,  we  will  use  a  probe  function  that  is  implemented  using  the  secret  key  Kd .  The  data  packet 
structure  will  be  similar  to  that  in  PAAI-2.  Now,  on  receiving  a  data  packet,  fd  can  independently 
decide  whether  it  must  be  acknowledged.  If  an  intermediate  node  receives  a  valid  ACK  from 
fd,  it  immediately  knows  that  the  packet  was  sampled  and  that  there  will  be  no  further  probe. 
For  a  sampled  data  packet,  S  will  send  out  a  probe  only  if  it  fails  to  receive  an  ACK  from  fd- 
The  remaining  details  follow  from  PAAI-2.  It  is  intuitive  to  see  the  new  protocol  incurs  lower 
communication  overhead  than  both  PAAI-1  and  PAAI-2,  but  at  the  price  of  a  longer  detection 
delay.  Its  performance  is  summarized  in  Table  4.1. 
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4.12  Summary 

In  this  chapter,  we  address  the  problem  of  designing  a  secure  fault  localization  protocol  that  offers 
a  practical  trade-off  between  detection  delay,  communication  overhead,  and  storage  overhead.  To 
this  end,  we  systematically  explore  the  design  space  of  path-based  fault  localization  protocols  where 
an  ACK  packet  acknowledges  a  single  data  packet,  and  propose  a  set  of  basic  protocols  where  each 
protocol  exemplifies  a  design  dimension.  Based  on  our  theoretical  analysis  and  simulation  results, 
we  conclude  that  the  proposed  PAAI-1  protocol  achieves  the  best  trade-off,  and  as  a  result  is 
more  practical  than  the  other  protocols.  We  note,  however,  that  PAAI  bears  some  limitations  in 
its  extensibility  and  generality;  e.g.,  both  PAAI-1  and  PAAI-2  require  loose  time-synchronization, 
which,  although  a  viable  assumption  for  many  network  settings,  might  limit  their  applicability.  We 
address  these  limitations  in  the  next  chapter. 
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Chapter  5 


ShortMAC 


Existing  fault  localization  protocols  cannot  achieve  a  practical  tradeoff  between  security  and  effi¬ 
ciency.  For  example,  they  require  unacceptably  long  detection  delays  and  require  monitored  flows  to 
be  impractically  long-lived.  Though  PAAI  improves  the  practicality  of  fault  localization  compared 
to  prior  work,  in  both  PAAI-1  and  PAAI-2,  an  ACK  packet  sent  by  a  router  only  acknowledges 
a  single  corresponding  packet.  Intuitively,  acknowledging  a  set  of  packets  with  one  ACK  packet 
might  further  reduce  the  communication  overhead,  eliminate  the  need  of  packet  sampling,  and 
eventually  reduce  the  detection  delay.  In  this  chapter,  we  propose  an  efficient  path-based  fault 
localization  protocol  called  ShortMAC,  in  which  routers  locally  cache  fingerprints  for  a  set  of  pack¬ 
ets  it  receives,  and  periodically  send  the  fingerprints  with  a  single  ACK  packet  to  the  source.  By 
leveraging  probabilistic  packet  authentication  and  efficient  fingerprinting  data  structure,  ShortMAC 
achieves  100  -  10000  times  lower  detection  delay  and  overhead  than  related  work. 

5.1  Introduction 

In  this  chapter,  we  propose  ShortMAC,  an  efficient  fault  localization  protocol  to  provide  a  theoret¬ 
ically  proven  guarantee  on  end-to-end  data-plane  packet  delivery  even  in  the  presence  of  sophisti¬ 
cated  adversaries.  More  specifically,  we  aim  to  guarantee  that,  given  a  correct  routing  infrastruc¬ 
ture,  a  benign  source  node  can  quickly  find  a  non-faulty  path  along  which  a  very  high  fraction  of 
packets  can  be  correctly  delivered.  Our  key  insights  are  two-fold: 
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Insight  1.  We  first  observe  that  localizing  data-plane  faults  along  a  communication  path  can  be 
reduced  to  monitoring  packet  count  (number  of  received  packets)  and  packet  content  (payload  of 
received  packets)  at  each  router  on  that  path.  Furthermore,  if  packets  can  be  efficiently  authen¬ 
ticated,  packet  count  also  becomes  a  verifiable  measure  of  packet  content,  because  forged  packets 
(with  invalid  contents)  will  be  dropped  by  the  routers  and  manifest  an  observable  deviation  in  the 
packet  count.  Thus,  routers  can  dramatically  reduce  storage  overhead  by  storing  counters  instead 
of  packet  contents. 

Insight  2.  We  also  observe  that  we  can  achieve  a  high  packet  delivery  guarantee  via  fault  lo¬ 
calization  by  limiting  the  amount  of  malicious  packet  drops/modifications,  instead  of  perfectly 
detecting  each  single  malicious  activity.  Furthermore,  strong  per-packet  authentication  to  achieve 
perfect  detection  of  every  single  bogus  packet  is  unnecessary  for  limiting  the  adversary’s  ability  to 
modify/inject  bogus  packets.  Instead,  the  source  can  use  much  shorter  packet-dependent  random 
integrity  bits  as  a  weak  authenticator  for  each  packet  such  that  each  forged  packet  has  a  non-trivial 
probability  to  be  detected.  In  this  way,  if  a  malicious  node  modifies  or  injects  more  than  a  thresh¬ 
old  number  of  (e.g.,  tens  of)  packets,  the  malicious  activity  will  cause  a  detectable  deviation  on 
the  counter  values  maintained  at  different  routers.  Essentially,  ShortMAC  traps  an  attacker  into  a 
dilemma:  if  the  attacker  inflicts  damage  worse  than  a  threshold,  it  will  be  detected,  which  may  lead 
to  removal  from  the  network;  otherwise,  the  damage  is  limited  and  thus  a  guarantee  on  data-plane 
packet  delivery  is  achieved. 

Contributions.  1)  We  propose  a  data-plane  fault  localization  protocol  ShortMAC  that  achieves 
high  security  assurance  with  100  -  10000  times  lower  detection  delay  and  storage  overhead  than 
related  work. 

2)  We  derive  a  provable  lower  bound  on  successful  end-to-end  packet  forwarding  rate,  by  limiting 
adversarial  activities  instead  of  perfectly  detecting  every  single  malicious  action  which  would  incur 
high  protocol  overhead. 

3)  We  theoretically  derive  the  performance  bounds  of  ShortMAC  and  evaluate  ShortMAC  via 
SSFNet-based  [6]  simulation  and  Linux/Click  router  implementation.  Our  implementation  and 
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evaluation  results  show  that  ShortMAC  causes  negligible  throughput  and  latency  costs  while  re¬ 
taining  a  high  level  of  security. 

5.2  ShortMAC  Overview 

ShortMAC  monitors  both  the  packet  count  and  content  at  each  hop.  Specifically,  a  router  maintains 
per-path  counters  to  record  the  number  of  received  data  packets  originated  from  the  source  in  the 
current  epoch.  To  ensure  that  the  packet  count  is  a  verifiable  measure  of  the  desired  monitoring 
task,  we  require  that  both  packet  modification  and  injection  by  malicious  (colluding)  routers  affect 
counter  values  at  benign  nodes. 

We  first  introduce  the  concept  of  an  epoch  to  facilitate  our  protocol  design  and  formal  analysis: 

Definition  12.  A  n  end-to-end  communication  is  composed  of  a  set  of  consecutive  epochs.  An  epoch 
for  an  end-to-end  path  is  defined  as  the  duration  of  transmitting  a  sequence  of  N  data  packets  by 
a  source  S  toward  a  destination  fd  along  that  path.  The  epochs  are  asynchronous  among  different 
paths. 

At  the  beginning  of  each  epoch  denoted  by  e*,,  a  source  node  S  selects  a  path  p  and  starts 
sending  packets  along  p,  with  each  packet  carrying  several  ShortMAC  authentication  bits.  The 
routers  verify  the  authentication  bits  in  each  received  packet  based  on  the  symmetric  key  shared 
with  the  source  node,  increment  locally  stored  counters  for  p  accordingly,  and  forward  only  the 
authentic  packets.  Due  to  the  ShortMAC  authentication  bits,  modified/injected  packets  can  result 
in  an  observable  deviation  in  the  counter  values  which  enable  fault  localization  by  the  source  at 
the  end  of  each  epoch. 

At  the  end  of  each  epoch  e*,,  the  source  S  retrieves  the  counter  reports  from  all  routers  and 
the  destination  in  p  for  e*,,  via  a  secure  channel  as  Section  5.3  will  describe.  S  then  performs  fault 
detection  based  on  the  retrieved  counters,  and  bypasses  the  detected  faulty  link  (if  any)  by  finding 
another  path  excluding  the  identified  faulty  link  (e.g.,  via  source  routing,  path  splicing  [72],  pathlet 
routing  [35],  or  SCION  routing  [96]).  The  detection  result  is  only  used  by  S  itself  for  selecting 
its  own  routing  paths,  instead  of  being  shared  with  other  nodes  which  is  susceptible  to  framing 
attacks. 
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Although  the  high-level  epoch-based  protocol  flow  (nodes  periodically  send  certain  locally  logged 
traffic  summaries  to  the  source)  bears  great  similarity  with  Fatih  [71],  Audit  [14],  and  Statistical 
FL  with  sketch  [21],  both  Fatih  and  Audit  use  simple  counters  or  Bloom  Filters  without  keyed 
hash  functions  as  the  traffic  summaries,  thus  remaining  vulnerable  to  packet  modification/injection 
attacks.  In  addition,  the  sketch-based  packet  fingerprints  used  in  Statistical  FL  consume  several 
hundreds  of  bytes  for  each  path.  In  contrast,  ShortMAC  efficiently  tackles  packet  modification 
attacks  with  only  several-byte  counters  as  shown  below. 

5.2.1  ShortMAC  Packet  Authentication 

Our  approach  is  to  turn  packet  count  into  a  reliable  measure  of  packet  content  so  that  routers 
only  need  to  store  space-efficient  counters.  To  this  end,  the  integrity  of  the  source’s  data  packets 
must  be  ensured  in  order  to  detect  malicious  packet  modification  during  the  forwarding  path; 
otherwise,  a  malicious  router  can  always  perform  packet  modification  attacks  without  affecting  the 
counter  values,  or  inject  bogus  packets  on  behalf  of  the  source  to  manipulate  the  counter  values 
of  the  reporting  routers.  Hence,  we  reduce  the  problem  to  how  the  source  node  can  authenticate 
its  packets  to  all  the  routers  in  the  path.  However,  traditional  broadcast  authentication  schemes 
provide  high  authenticity  for  every  single  message,  which  is  neither  necessary  nor  practical  in  our 
setting  where  the  messages  are  line-rate  packets: 

1)  Not  practical:  On  one  hand,  perfectly  ensuring  the  authenticity  of  every  single  data  packet 
introduces  high  overhead  in  a  high-speed  network.  For  example,  digital  signatures  or  one-time  sig¬ 
natures  for  per-packet  authentication  is  either  computationally  expensive  or  bandwidth-exhaustive, 
and  using  amortized  signatures  would  either  fail  in  the  presence  of  packet  loss  or  incur  high  commu¬ 
nication  overhead  [63].  Attaching  a  Message  Authentication  Code  (MAC)  for  each  node  along  the 
path  (as  is  used  by  Avramopoulos  et  al.  [17])  is  too  bandwidth-expensive  (e.g.,  reserving  a  160-bit 
MAC  space  for  each  hop).  In  addition,  TESLA  authentication  [77]  would  require  time  synchro¬ 
nization  and  routers  to  cache  the  received  packets  until  the  authentication  key  is  later  disclosed 
(longer  than  the  end-to-end  path  latency).  Finally,  some  recently  proposed  multicast/broadcast 
authentication  schemes  still  require  considerable  communication  overhead  (e.g.,  up  to  hundreds  of 
bytes  per  packet  [64])  or  multiple  rounds  for  authenticating  a  message  [29]. 
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2)  Not  necessary:  On  the  other  hand,  as  we  aim  to  limit  the  damage  the  adversary  can  inflict 
for  a  lower-bound  guarantee  on  data-plane  packet  delivery,  perfect  per-packet  authenticity  is  not 
necessary.  Instead,  our  goal  only  requires  the  authenticity  of  a  large  fraction  of  data  packets. 


ShortMAC  approach.  Based  on  these  observations,  we  propose  ShortMAC,  a  light-weight 
scheme  trading  per- hop  overhead  with  the  adversary’s  ability  to  forge  only  a  few  (e.g.,  tens  of) 
packets.  More  specifically,  in  ShortMAC,  the  source  attaches  to  each  packet  a  k-bit  random  nonce, 
called  k-bit  MAC,  for  each  node  on  the  path,  where  the  parameter  k  is  significantly  less  than  the 
length  of  a  typical  MAC  (e.g.,  k  =  2).  To  construct  the  k- bit  MAC  for  fi,  the  source  S  uses  a 
Pseudo-Random  Function  (PRF)  which  constructs  a  k- bit  string  as  a  function  of  the  packet  m  and 
key  Ksi  shared  between  S  and  fi.  We  rely  on  the  result  that  the  output  k- bit  MAC  is  indistin¬ 
guishable  from  a  random  £>bit  string  to  any  observer  without  the  secret  key  Ksl  [67].  Each  router 
fi  maintains  two  path-specific  counters  Cfood  and  C\ad  to  record  the  numbers  of  received  packets 
along  that  path  with  correct  and  incorrect  A;-bit  MACs,  respectively,  in  the  current  epoch.  Such 
a  scheme  considerably  reduces  communication  overhead  compared  to  attaching  entire  MACs  while 
retaining  high  security  assurance  and  communication  throughput,  as  shown  later. 


5.2.2  ShortMAC  Example 

We  present  a  toy  example  in  Figure  5.1  to  provide  intuition  on  how  ShortMAC  enables  data-plane 
fault  localization.  Suppose  the  source  node  sends  out  1000  packets  in  a  certain  epoch.  The  source 
uses  a  PRF  taking  a  secret  key  as  input  which  can  map  a  packet  into  two  bits  (called  2-bit  MAC) 
uniformly  at  random  to  anyone  without  knowledge  of  the  secret  key.  The  source  computes  the  PRF 
four  times  for  each  packet,  taking  as  input  the  epoch  symmetric  key  shared  with  fi,  f 2,  f 3,  and  the 
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Figure  5.1:  Fault  localization  example  with  ShortMAC  using  2-bit  MAC.  /2  is  malicious. 
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destination,  respectively.  Then  the  source  attaches  the  resulting  four  2-bit  MACs  to  each  packet. 

Among  the  1000  packets,  suppose  three  packets  are  spontaneously  dropped  on  the  first  link,  and 
router  f\  receives  the  remaining  997  packets.  f±  computes  the  PRF  on  each  of  the  received  packets 
taking  as  input  the  epoch  symmetric  key  shared  with  the  source,  and  compares  the  resulting 
2-bit  MACs  with  the  one  embedded  in  each  packet.  All  verifications  are  successful,  so  /i  has 
Of00'1  =  997  and  C\ad  =  0.  Suppose  the  malicious  router  /2  drops  100  good  packets  and  injects 
100  malicious  packets.  For  each  injected  packet,  /2  needs  to  forge  2-bit  MACs  for  both  / 3  and 
the  destination  that  “authenticate”  the  fabricated  data  content.  However,  since  /2  does  not  know 
the  corresponding  epoch  symmetric  keys  of  fs  and  the  destination,  /2  can  only  guess  the  2-bit 
MACs  for  its  injected  packets.  Since  the  2-bit  MACs  produced  by  the  PRF  are  indistinguishable 
from  random  bits,  fi  can  correctly  guess  each  2-bit  MAC  with  probability  Since  /2  must 
guess  two  correct  MACs,  each  forged  packet  will  be  accepted  by  the  destination  with  probability 
Yg.  Suppose  next  that  26  of  the  100  2-bit  MACs  that  /2  forged  for  / 3  happen  to  be  valid  with 
respect  to  the  the  malicious  data  content.  thus  computes  Cbad  =  100  —  26  =  74  and  C^ood  = 
997  —  100  (dropped  legitimate  packets)  +26  (bogus  but  undetected  packets)  =  923.  Similarly,  we 
can  analyze  the  counters  for  the  destination  in  Figure  5.1,  assuming  7  out  of  the  26  received  bogus 
packets  happen  to  be  consistent  with  their  2-bit  MACs  at  the  destination. 

5.2.3  Fault  Localization  and  Guaranteed  6 

At  the  end  of  each  epoch,  routers  and  the  destination  report  their  counter  values  to  the  source  using 
a  secure  transmission  approach  (detailed  in  Section  5.3).  The  source  can  identify  excessive  packet 
drops  between  fm  and  fm+ 1  if  the  value  of  fm+ 1  is  abnormally  lower  than  that  of  fm  based 

on  the  drop  detection  threshold  T&r  that  is  carefully  set  based  on  the  customized  acceptable  per- 
link  drop  rate.  Moreover,  this  scheme  can  successfully  bound  the  total  number  of  spurious  packets 
with  fabricated  k- bit  MACs  that  the  adversary  can  inject,  because  at  least  one  of  the  downstream 
recipient  routers  will  detect  the  inconsistency  of  the  /c-bit  MACs  with  a  non-trivial  probability,  thus 
having  a  non-zero  Cbad  value.  For  example  in  Figure  5.1,  although  fi  can  claim  any  values  for  its 
own  counters,  no  matter  what  values  /2  claims,  the  source  can  notice  excessive  packet  loss  and  a 
large  number  of  fake  packets  either  between  f±  and  /2,  or  /2  and  fs.  Hence  one  of  /2’s  malicious 
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links  will  be  detected  by  the  source. 

Once  the  source  S  bypasses  all  malicious  links  identified  by  ShortMAC,  S  can  find  a  working 
path  with  no  excessive  packet  corruption  at  any  link,  thus  achieving  a  guaranteed  successful  for¬ 
warding  rate  9.  With  secure  fault  localization,  a  source  can  find  a  working  path  after  exploring 
at  most  paths,  where  Ll  is  the  number  of  malicious  links  in  the  network.  In  contrast,  with  only 
end-to-end  path  monitoring,  a  source  may  explore  a  number  of  paths  exponential  to  D.  as  we  showed 
in  Section  1.1. 


5.3  ShortMAC  Details 

In  this  section  we  describe  the  ShortMAC  protocol  in  detail,  where  the  source  can  either  guarantee 
that  a  high  fraction  9  of  its  data  has  been  correctly  forwarded  if  no  malicious  activities  are  detected, 
or  can  bypass  the  faulty  links  and  find  a  working  path  after  exploring  a  number  of  paths  linear  to 
the  number  of  faulty  links.1  In  the  following,  we  first  formalize  the  ShortMAC  packet  format  and 
then  detail  the  protocol. 

5.3.1  ShortMAC  Packet  Format 

A  source  node  S  adds  a  trailer  to  each  data  packet  it  sends: 

(5.1)  trailer  =  (SN,  Mi, ...,  Md), 

where  SN  is  a  per-path  sequence  number  to  make  each  packet  unique  along  the  same  path  to  prevent 
packet  replay  attacks,  and  Mi  denotes  the  A;- bit  MAC  computed  for  /*,  which  is  constructed  in  a 
recursive  way  starting  from  fff 

Md  «-  PRFKsd(IPinvar\\SN\\TTLd) 

Md_i  «-  PRFk^^  (IPinvar\\SN\\TTLd_i\\Md) 


Mi  «-  PRFK^IPinvarWSNWTTLiWMi+lW  . . .  II Md) 


1Recall  that  forwarding  fault  localization  protocols  protocols  can  only  identify  faulty  links,  rather  than  identifying 
the  nodes  [21].  However,  given  that  a  malicious  node  has  a  limited  degree,  after  bypassing  all  its  malicious  links  the 
source  can  eventually  bypass  that  node. 
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maliciously 
modifies  M3 


TTL  —  2  TTL  —  1  jdrop  due  to 
to  TT L  =  1  to  TT L  =0; ttl=0 


h 

detects  bad  A 
increases  C\a 


-HD 

Destination 


Figure  5.2:  Illustration  of  framing  attacks.  f±  is  malicious. 


where  “||”  denotes  concatenation  and  PRFKsi{-)  denotes  a  PRF  keyed  by  the  symmetric  key  Ks{ 
shared  between  S  and  /*.  As  previously  discussed,  the  output  of  this  PRF  can  be  guessed  correctly 
with  probability  no  larger  than  ^  by  anyone  without  the  secret  key  Ksi  [67].  In  addition, 


1)  IPinvar  denotes  the  invariant  portion  of  the  original  IP  packet  that  should  not  be  changed 
at  each  router  during  forwarding,  including  the  packet  payload  and  IP  headers  excluding  variable 
fields  such  as  TTL,  RecordRoute  IP  option,  Timestamp  IP  option  etc.  If  these  invariant  fields 
are  unexpectedly  changed  during  forwarding,  each  downstream  router  can  detect  inconsistency 
between  the  (modified)  packet  and  embedded  fc-bit  MAC  with  a  non-trivial  probability  1  —  ^  and 
thus  increase  its  Cbad  counter. 


2)  TTLi  denotes  the  expected  TTL  value  at  router  i.  Without  authenticating  this  held  in  the 
A;-bit  MAC,  a  malicious  router  can  strategically  lower  the  TTL  held  to  cause  packet  drop  at  a 
remote  downstream  router  due  to  zero  TTL  value,  thus  performing  framing  attacks.  For  example 
in  Figure  5.2,  if  A ii  in  Eq.(5.2)  had  not  authenticated  the  TTL  held,  can  maliciously  change 
the  TTL  value  in  the  packets  to  2,  instead  of  decrementing  it  by  1.  This  causes  the  packets  to  be 
dropped  at  fy,  thus  framing  the  link  between  /2  and  fa. 

3)  A ii  also  authenticates  the  downstream  Afj+i, . . . ,  A id,  so  that  if  a  malicious  router  fm  changes 
any  of  these  downstream  fc-bit  MACs,  /*  can  observe  the  inconsistency  in  Af *  with  a  probability 
1  —  ^7  and  increase  its  Cbad  value.  Otherwise,  the  protocol  is  vulnerable  to  framing  attacks.  For 
example  in  Figure  5.2,  if  Adj  in  Eq.(5.2)  had  not  authenticated  the  downstream  A;— bit  MAC  held, 
/i  can  maliciously  modify  AI3  in  the  packets  which  causes  fs  to  detect  inconsistent  AI3  with  a 
non-trivial  probability  and  increase  C^ad,  thus  framing  the  link  between  /2  and  / 3. 
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5.3.2  Protocol  Details 

Formally,  ShortMAC  consists  of  Request,  Report,  Identify,  Bypass  and  Send  stages,  described  as 
follows. 

Stage  1:  Request  with  hop- by- hop  reliable  transmission 

At  the  end  of  each  epoch  (i.e. ,  after  sending  every  N  data  packets),  the  source  S  will  send  a 
request  packet,  denoted  by  request=  ( S,p ),  along  the  path  p  =  (fi,  •  ■  ■ ,  fd)  used  in  epoch  e^. 
This  request  asks  each  router  fi  and  the  destination  fd  to  report  their  counter  values  (C^ad  and 
Cfood)  along  the  reverse  of  path  p.  Then  S  expects  these  counter  reports  in  Acknowledgment  (ACK) 
packets  from  all  the  nodes  in  p  containing  the  requested  information  authenticated  with  each  node’s 

Ksi. 

Note  that  a  spontaneous  loss  of  request  or  ACK  packets  will  prevent  S  from  learning  the 
counter  values  by  certain  routers  in  the  previous  epoch.  To  preclude  such  damage,  we  use  the 
following  hop- by- hop  reliable  transmission  approach:  when  fi  forwards  either  a  request  or 
an  ACK  packet  to  its  neighbor,  /*  tries  up  to  r  times  (e.g.,  r  =  5)  until  it  gets  a  confirmation  from 
the  neighbor.  In  this  way,  the  failure  of  receiving  a  request  or  ACK  packet  can  only  indicate 
malicious  drops  -  more  precisely,  with  the  probability  of  1  —  pr ,  where  p  is  the  natural  loss  rate 
of  a  link.  Then  thanks  to  the  Onion  ACK  approach  presented  below,  the  source  can  immediately 
identify  a  malicious  link  that  drops  or  modifies  request  or  ACK  packets;  hence  the  request 
packets  do  not  need  to  be  authenticated  by  the  source  as  we  show  below. 

Stage  2:  Report  with  Onion  ACK 

Upon  receiving  a  request,  fi  starts  a  timer  whose  value  is  the  maximum  round  trip  time  from  fi 
to  the  destination.2  At  the  same  time,  fi  constructs  its  local  report  7 Zp 

(5.3)  Ki  =  (fi,P,Cf°od,C$ad) 

2We  can  expect  a  reasonable  upper  bound  of  link  latency  in  benign  cases,  which  can  be  used  to  compute  the 
maximum  round  trip  time  according  to  the  hop  count  from  fi  to  the  destination.  Avramopolous  et  al.  [17]  first 
introduced  the  use  of  such  a  timer. 
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where  f%  is  the  node  id,  p  is  the  requested  path,  and  Cfood  and  C^ad  are  the  counter  values  from 
the  previous  epoch.  Each  router  finds  Cf°od  and  C\od  corresponding  to  path  p  based  on  the  source 
and  destination  IDs  in  p  (assuming  single  path  routing).  Once  the  report  is  constructed: 

Case  1.  If  fi  receives  an  ACK  Aj+i  from  neighbor  fi+ 1  before  the  timer  expires,  /,,  further  commits 
IZi  into  a  new  ACK  Ai  by  combining  the  received  A*+i  via  an  Onion  ACK  approach: 

(5.4)  A  =  fc,  A+i.MAC^T^HA+i)), 

MAC/fsi(-)  denotes  a  message  authentication  code  computed  with  Ksi.  Then,  fi  forwards  Ai  to 
fi- 1  toward  S. 


Case  2.  If  fi  receives  no  ACK  packet  from  fi+i  before  the  timer  expires,  fi  will  initiate  a  new 
ACK  with  its  local  report  and  send  it  to  fi-i- 

The  Onion  ACK  prevents  the  adversary  from  selectively  dropping  the  request  or  the  reports 
of  a  certain  router  /)  and  framing  a  benign  link  [97].  In  Onion  ACK,  all  the  reports  are  combined 
and  authenticated  in  one  ACK  packet  at  each  hop  so  that  a  malicious  node  can  only  drop  or  modify 
the  onion  report  from  its  immediate  neighbors.  Intuitively,  if  fm  drops  or  modifies  the  received 
request  or  Onion  ACK,  the  source  can  receive  the  correct  reports  from  /i, . . . ,  /m_i  but  not  from 
fmi  ■  ■  ■  i  fd'i  hence  one  of  /m’s  links  will  be  pinpointed  by  the  source  node,  in  the  identify  stage 
described  below. 

After  sending  the  local  reports,  each  router  fi  resets  Cf"od  and  C^ad  to  zero,  to  be  used  for  the 
next  epoch  along  path  p  (if  p  is  still  used). 

Stage  3:  Identify 

Upon  receiving  an  Onion  ACK  A\  from  f±,  S  first  iteratively  retrieves  A\ ,  A , ...  in  order,  until  it 
either  completes  at  d  or  fails  at  j  (j  d).  S  can  verify  if  a  certain  retrieved  report  IZi  is  valid 
by  checking  the  embedded  message  integrity  code  MAC^ (77j | |A*+i).  When  the  check  fails  at  j 
(j  A  d),  S  will  immediately  identify  lj  as  faulty  due  to  the  use  of  reliable  hop- by- hop  transmission 
and  Onion  ACK.  For  example,  if  S  receives  no  report  it  will  identify  l\  as  faulty  (j  =  1). 


5.4.  SECURITY  ANALYSIS 


59 


In  addition,  S  extracts  IZ\ , . . . ,  IZj  in  turn  which  include  the  Cdad  and  Cfood  values.  A  non-zero 
Cfad  implies  the  existence  of  malicious  packet  injection  between  /*  and  S.  However,  S  cannot 
blame  U  simply  whenever  Cdad  >  0,  say,  Cdad  =  1.  A  possible  scenario  is  that  a  malicious  node 
fi—2  injects  a  fake  packet,  but  the  fc-bit  MAC  intended  for  /*_ i  “happens”  to  be  consistent  with  the 
fake  packet  at  benign  node  fi-±  (e.g.,  when  k  =  2,  this  can  happen  with  probability  0.25).  In  this 
case,  fi- 1  will  forward  the  fake  packet  which  fi  may  detect  and  thus  increase  C\ad.  Similarly,  due 
to  natural  packet  loss,  S  cannot  simply  accuse  link  li  when  c9°od  <  Cff°d.  Therefore,  we  leverage 
two  detection  thresholds  Tjn  and  Tc]r.  where  is  the  injection  detection  threshold  for  the  number 
of  injected  packets  on  each  link,  and  T&r  is  the  drop  detection  threshold  for  the  fraction  of  dropped 
packets  on  each  link.  As  we  will  show  in  Section  5.5,  these  thresholds  reduce  false  positives  while 
limiting  the  adversary’s  ability  to  corrupt  packets  and  ensuring  a  lower  bound  on  the  successful 
packet  forwarding  rate.  The  detection  thresholds  are  used  in  two  detection  procedures: 

1)  check-injection:  S  checks  the  extracted  C±ad,C2ad,  ■ . .,  Cjad  values  in  order.  If  C^ad  >  T*n  for 
some  i,  then  S  identifies  li  as  faulty  and  the  check-injection  procedure  stops. 

2)  check- dropping:  If  no  fault  is  detected  by  check-injection,  S  further  checks  the  extracted  Cfood, 
C92ood,  . . .,  Cj°od  values  in  order.  If  Cfood  <  (1  —  T^r)  •  C9°°d  (with  Cq00</  =  N)  holds  for  certain  i , 
then  S  identifies  li  as  faulty  and  the  check- dropping  procedure  terminates. 

Stage  4:  Bypass  and  Send 

If  Stage  2  outputs  any  malicious  link  lm,  S  selects  a  new  path  excluding  the  previously  detected 
malicious  links  and  sends  its  packets  with  ShortMAC  authentication  shown  in  Eq.(5.2).  Each  node 
fi  examines  its  corresponding  fc-bit  MAC  A4 i  in  each  packet  to  increase  Cfood  or  C^ad  accordingly. 
In  addition,  each  router  remembers  the  last  seen  per-path  SN  embedded  in  the  packets  as  shown 
in  Eq.(5.1),  and  discards  packets  with  older  SN  in  that  path. 

5.4  Security  Analysis 

This  section  discusses  ShortMAC’s  security  against  data-plane  attacks  by  malicious  routers.  Sec¬ 
tion  5.5  provides  theoretical  proofs  on  ShortMAC’s  security.  In  our  adversary  model,  a  malicious 
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router  can  drop  and  inject  data  packets,  requests  and  ACKs,  and  can  send  arbitrary  counter 
values  in  its  reports.  We  show  that  ShortMAC  is  secure  against  a  single  malicious  router  (say,  fm) 
as  well  as  multiple  colluding  nodes. 

Corrupting  data  packets.  Dropping  legitimate  data  packets  by  fm  will  cause  a  discrepancy 
of  the  counter  values  between  fm  and  its  neighbors.  For  example,  if  fm  correctly  reports  Cm°d, 
then  Cm°d  —  C^°d  will  exhibit  a  large  discrepancy;  if  fm  reports  a  lower  Cm°d ,  then  —  Cm°d 

will  exhibit  a  large  discrepancy.  Hence,  either  lm-i  or  lm  will  become  suspicious.  Moreover,  if 
fm  injects/modifies  packets,  AAm+\  will  be  inconsistent  at  fm+i  with  high  probability  and  cause  a 
non-zero  Hence,  both  dropping  and  injection  attacks  can  be  detected  as  long  as  the  source 

can  learn  the  correct  counter  values  in  the  ACK  packets  sent  by  the  nodes  between  fm  and  the 
destination,  which  is  described  next. 

Corrupting  ACKs  or  requests.  Since  the  requests  are  not  authenticated  by  S,  fm  can  modify 
the  content  of  requests  (such  as  the  source  ID  and  the  path);  however,  this  will  result  in  S  failing 
to  receive  the  correct  counter  reports  from  /m+i  (or  fm )  , . . . ,  fd  in  p,  thus  causing  lm+i  or  lm  to 
be  detected.  fm  cannot  selectively  drop  the  ACK  reports  due  to  the  use  of  Onion  ACK.  Instead, 
fm  can  only  drop  the  ACKs  or  requests  from  its  immediate  neighbors,  which  will  again  harm  its 
incident  links. 

Replay,  reorder,  and  traffic  analysis  attacks.  To  prevent  replay  and  reorder  attacks,  each 
packet  contains  a  per-path  sequence  number  SN  in  Eq.(5.1)  and  each  router  discards  packets  with 
older  SNs.  Hence,  the  replayed  and  reordered  packets  will  be  dropped  at  the  next-hop  benign  node 
without  influencing  the  counter  values  of  benign  nodes.  Note  that  because  ShortMAC  runs  on  a 
per-path  basis  and  a  SN  is  a  per-path  sequence  number  providing  natural  isolation  across  different 
paths,  packets  along  the  same  path  are  expected  to  maintain  the  same  order  during  forwarding 
as  they  were  sent  by  the  source  in  benign  cases.  On  the  other  hand,  if  fm  falsely  reports  a  large 
SN,  fm+i  will  drop  the  subsequent  packets  and  lm  will  be  identified  as  malicious  due  to  its  high 
packet  drop  rate.  Moreover,  the  per-path  SN  can  prevent  ShortMAC  from  traffic  analysis  attacks , 
where  fm  attempts  to  find  out  the  correct  /c-bit  MAC  of  a  packet  m  by  re-sending  m  with  different 
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/c-bit  MACs  and  observing  whether  the  next-hop  fm+ 1  forwards  the  packet.  Such  traffic  analysis 
is  ineffective  because  fm+ 1  can  detect  packets  with  the  same  SN  and  each  packet  is  unique  due  to 
the  use  of  the  per-path  SN,  and  thus  fm  cannot  send  the  same  packet  m  with  only  the  fc-bit  MAC 
changed. 

DoS  attacks.  A  malicious  router  fm  may  launch  bandwidth  Denial-of-Service  (DoS)  attacks  by 
generating  an  excessive  amount  of  packets.  However,  this  attack  can  be  reduced  to  a  packet  injection 
attack  and  will  be  reflected  by  .  A  malicious  router  may  also  attempt  to  open  many  bogus 
flows  with  spoofed  sources  to  exhaust  other  routers’  state.  We  can  borrow  existing  work  to  provide 
source  accountability  and  reliable  flow/path  identification  [12,  92],  Also  note  that  in  our  adversary 
model  we  consider  malicious  routers  which  threaten  the  communication  between  benign  hosts.  We 
do  not  consider  DDoS  attacks  launched  by  malicious  hosts  (botnets),  which  other  researchers  have 
strived  to  defend  against  [59,  92,  61].  Hence  in  our  problem  setting,  a  link  under  DDoS  attacks  thus 
exhibiting  high  loss  rate  is  simply  considered  a  faulty  link  under  our  adversary  model.  Meanwhile, 
the  path  setup  phase  in  ShortMAC  can  be  naturally  integrated  with  capability  schemes  [92]  for 
DDoS  limiting,  and  the  per-path  counters  may  also  be  used  for  per-path  rate  limiting. 

Collusion  attacks.  Each  of  the  colluding  routers  can  commit  any  of  the  misbehavior  discussed 
above.  We  can  prove  by  induction  that  in  any  case,  one  of  the  malicious  links  of  one  of  the  colluding 
nodes  is  guaranteed  to  be  detected.  A  proof  sketch  is  given  below. 

Consider  the  base  case  where  two  nodes  fm  and  fmi  ( m  <  m')  collude.  Without  loss  of  gener¬ 
ality: 


Source 

•  * 

S 


F 


lm-\  f  lm  f  lm+1 


f  lm- 1  f  hn  f 
J  m-l  Jm  Jr 


m+ 1 


Destination 

•  • 

fd 


Figure  5.3:  Security  against  colluding  nodes  -  one  base  case  with  two  adjacent  colluding  nodes  fm 
and  /m+i  forming  a  virtual  malicious  node  Fm. 
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1)  When  fm  and  fm>  are  not  adjacent  (i.e. ,  nn!  >  m  +  1),  the  security  analysis  in  Section  5.4 
applies  to  fm  and  one  of  /m’s  malicious  links  will  become  suspicious  if  fm  misbehaves.  This  is 
because  if  fm  commits  the  above  attacks,  such  misbehavior  will  be  reflected  in  the  benign  neighbor 
fm+  i’s  counters  which  cannot  be  biased  by  fm'. 

2)  When  fm  and  fm'  are  adjacent  (m!  =  m  +  1),  these  two  nodes  can  be  regarded  as  one  single 

“virtual”  malicious  node  Fm  with  neighbors  fm- 1  and  fm+ 2,  as  shown  in  Figure  5.3.  (i)  If  fm  or 
fm- |-i  drops  packets,  a  discrepancy  will  exist  between  and  no  matter  what  values  of 

Cm°d  and  C9Zdi  Fm  claims,  (ii)  If  fm  or  fm+ i  injects  packets,  will  become  non-zero  and 

make  lm+i  suspicious.  In  any  case,  an  adjacent  link  of  Fm  (a  malicious  link)  will  become  suspicious. 

In  the  general  case  with  n  colluding  nodes,  we  can  first  group  adjacent  colluding  nodes  into 
virtual  malicious  nodes  as  in  Figure  5.3,  resulting  in  non-adjacent  malicious  nodes  (including  virtual 
malicious  nodes).  Then  we  can  show  non-adjacent  malicious  nodes  can  be  detected  based  on  the 
above  analysis. 

Despite  colluding  attackers  cannot  corrupt  packets  more  than  the  same  thresholds  as  an  indi¬ 
vidual  attacker  on  any  single  link,  they  can  choose  to  distribute  packet  dropping  across  multiple 
links.  In  this  case,  the  total  packet  drop  rate  by  colluding  attackers  increases  (and  is  still  bounded) 
linearly  to  the  number  of  malicious  links  in  the  same  path,  as  analyzed  in  Section  5.5. 


5.5  Theoretical  Results  and  Comparison 

We  prove  the  (IV,  5)— data-plane  fault  localization  (Definition  3)  and  (a,  f3)s~ forwarding  security 
of  ShortMAC  (Definition  5),  which  in  turn  yield  the  9— guaranteed  forwarding  correctness  (Defini¬ 
tion  4).  Proofs  of  the  lemmas  and  theorems  are  provided  in  Appendix  B. 

Comparison  of  theoretical  results.  Before  presenting  the  theorems,  we  first  summarize  and 
compare  ShortMAC  theoretical  results  with  two  recent  proposals,  PAAI-1  [97]  and  Stat.  FL  [21] 
(including  two  approaches  denoted  by  SSS  and  sketch).  Table  5.1  presents  the  numeric  figures 
using  an  example  parameter  setting  for  intuitive  illustration,  while  ShortMAC  presents  similarly 
distinct  advantages  in  other  parameter  settings.  In  this  example  scenario  shown  in  the  table, 
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Protocol 

ShortMAC 

PAAI-1 

SSS 

Sketch 

Detect.  Delay  (pkt) 

3.8  x  104 

7.1  x  105 

1.6  x  108 

ss  10e 

Comm,  (extra  %) 

<  10"5 

1 

1 

<  1(T5 

Marking  Cost  (bytes) 

2 

0 

0 

0 

Per-path  State  (bytes) 

21 

2xl05 

4  x  108 

«  500 

Table  5.1:  Theoretical  comparison  with  PAAI-1  [97]  and  Stat.  FL  [21]  (including  two  approaches 
SSS  and  sketch).  Note  that  the  details  of  sketch  are  not  provided  in  the  published  paper  [21],  and 
the  full  version  of  [21]  does  not  present  the  explicit  bounds  on  detection  delay.  The  above  figures 
for  sketch  are  estimated  from  their  earlier  work  [36].  In  this  example  scenario,  d  =  5,  5  =  1%, 
p  =  0.5%,  Tdr  =  1.5%,  a  symmetric  key  is  16  bytes,  and  ShortMAC  uses  2-bit  MACs.  PAAI-1 
specific  parameters  include  the  “packet  sampling  rate”  set  to  0.01,  the  end-to-end  latency  set  to  25 
ms,  the  source’s  sending  rate  set  to  106  packets  per  second,  each  packet  hash  is  128  bits. 


the  guaranteed  data-plane  packet  delivery  ratio  is  9  =  92%.  The  communication  overhead  for  a 
router  in  ShortMAC  is  1  extra  ACK  for  every  3.8  x  104  data  packets  in  an  epoch;  the  marking 
cost  is  10  bits  for  the  2-bit  MACs  in  a  path  with  5  hops,  and  the  per-path  state  at  each  router  is 
21  bytes  (16-byte  symmetric  key,  2-byte  C9°od,  1-byte  Cbad.  and  2-byte  per-path  SN).  Though 
Barak  et  al.  proved  the  necessity  of  per-path  state  for  a  secure  fault  localization  protocol  [21], 
such  a  minimal  per-path  state  in  ShortMAC  is  viable  for  both  intra-domain  networks  with  tens  of 
thousands  of  routers  and  the  Internet  AS-level  routing  among  currently  tens  of  thousands  of  ASes. 

We  provide  the  intuition  for  ShortMAC’s  distinct  advantages.  PAAI-1  or  Stat.  FL  used  either 
low-rate  packet  sampling  or  approximation  techniques  for  packet  fingerprinting,  both  of  which 
waste  entropy  contained  in  certain  packet  transmissions,  thus  resulting  in  long  detection  delay 
(e.g.,  the  transmission  results  of  non-sampled  packets  will  not  contribute  to  the  detection  phase). 
In  contrast,  ShortMAC  counts  every  packet  transmission  thus  achieving  much  faster  detection  rate. 
In  addition,  secure  packet  sampling  requires  additional  packet  buffering  [97],  and  packet  fingerprint 
takes  considerable  memory  [21]. 


Lemma  13.  Injection  Detection:  Given  the  bound  5  on  detection  false  negative  and  false  positive 


2  In  4? 


rates,  the  injection  detection  threshold  Tin  can  be  set  to  Tin  =  — 


where  d  is  the  path  length  and 


q  =  2  ok  1  is  the  probability  that  a  fake  packet  will  be  inconsistent  with  the  associated  k-bit  MAC. 
The  number  of  fake  packets  (3  an  adversary  can  inject  on  one  of  its  malicious  links  without  being 
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detected  is  limited  to: 

Tin  \/  ( ln  I ) 2  +  ln~|  +  In  f 
q  4  q2 

In  Lemma  14,  we  derive  N,  the  number  of  data  packets  a  source  needs  to  send  in  one  epoch 
to  bound  the  detection  false  positive  and  false  negative  rates  below  5.  Due  to  natural  packet  loss, 
a  network  operator  first  sets  an  expectation  based  on  her  domain  knowledge  such  that  any  benign 
link  in  normal  condition  should  spontaneously  drop  less  than  p  fraction  of  packets.  We  first  describe 
how  the  drop  detection  threshold  Tdr  is  set  when  N  and  5  are  given.  Intuitively,  by  sending  more 
data  packets  (larger  N ),  the  observed  per-link  drop  rate  can  approach  more  closely  its  expected 
value,  which  is  less  than  p;  otherwise,  with  a  smaller  N,  the  observed  per-link  drop  rate  can  deviate 
further  away  from  p ,  and  the  drop  detection  threshold  Tdr  has  to  tolerate  a  larger  deviation  (thus 
being  very  loose)  in  order  to  limit  the  false  positive  rate  below  the  given  5.  On  the  other  hand, 
a  small  N  is  desired  for  fast  fault  localization.  We  define  Detection  Delay  to  be  the  minimum 
value  of  N  given  the  required  5. 


Lemma  14.  Dropping  Detection  and  (N,6)-  Data- Plane  Fault  Localization:  Given  the 
bound  6  on  detection  false  positive  and  negative  rates  and  drop  detection  threshold  Tdr,  the  detection 
delay  N  is  given  by: 


(5.6) 


N  = 


Ht) 

2{Tdr-  pf(l-Tdr)d' 


where  d  is  the  path  length.  Correspondingly,  the  fraction  of  packets  a  an  adversary  can  drop  on 
one  of  its  malicious  links  without  being  detected  is  limited  to: 


(5.7) 


oc  —  1  —  (1  —  Tdr )2  + 


P 

N(1  —  Tdr)d ' 


In  practice,  Tdr  can  be  chosen  according  to  the  expected  upper  bound  p  of  a  “reasonable” 
normal  link  loss  rate  such  that  a  drop  rate  above  Tdr  is  regarded  as  “excessively  lossy” . 


Theorem  15.  Forwarding  Security  and  Correctness:  GivenTdr,  6,  and  path  length  d,  we  can 
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achieve  (a,  (3)  5— forwarding  security  where  a  is  given  by  Lemma  14  and  (3  is  given  by  Lemma  13. 
We  also  achieve  (0,  9)-Guaranteed  forwarding  correctness  with  0  equal  to  the  number  of  malicious 
links  in  the  network,  and 

(5-8)  9  =  {l  -Tdr)d-^. 

where  N  is  derived  from  Lemma  If 

In  Theorem  16,  we  analyze  the  protocol  overhead  with  the  following  three  metrics  (we  further 
analyze  the  throughput  and  latency  in  Section  5.7  via  real-field  testing): 

1)  The  communication  overhead  is  the  fraction  of  extra  packets  each  router  needs  to  transmit. 

2)  The  marking  cost  is  the  number  of  extra  bits  a  source  needs  to  embed  into  each  data  packet. 

3)  The  per-path  state  is  defined  as  the  per-path  extra  bits  that  a  router  stores  for  the  security 
protocol  in  fast  memory  needed  for  per-packet  processing.3 * 

Theorem  16.  Overhead:  For  each  router,  the  communication  overhead  is  one  packet  for  each 
epoch  of  N  data  packets.  The  marking  cost  is  k  ■  d  bits  for  the  k-bit  MACs  where  d  is  the  path 
length.  The  per-path  state  comprises  one  lg  N -bit  C9°od  counter,  one  lg  /3-bit  Cbad  counter,  one 
IgN-bit  last-seen  per-path  SN,  and  one  epoch  symmetric  key. 


5.6  SSFNet-based  Evaluation 

In  addition  to  analyzing  the  theoretical  performance,  we  implement  ShortMAC  prototype  on  the 
SSFNet  simulator  [6]  to  study  the  detection  delay  and  security  of  ShortMAC.  Section  5.7  further 
investigates  ShortMAC’s  throughput  and  latency.  These  experimental  results  provide  average-case 
performance  with  various  attack  strategies  to  complement  the  theoretical  results  derived  in  the 
worst  case  scenario  (due  to  multiple  mathematical  relaxations  such  as  Hoeffding  inequality)  and 
constant  dropping/injection  rates. 

3The  buffering  space  needed  for  the  Onion-ACK  construction  of  report  messages  in  ShortMAC  is  not  a  major 

concern,  as  the  Onion-ACK  is  computed  only  once  every  epoch,  which  can  be  buffered  in  off-chip  storage. 
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Evaluation  scenario  and  attack  pattern.  Since  ShortMAC  provides  a  natural  isolation  across 
paths  due  to  its  per-path  state,  our  evaluation  focuses  on  a  single  path.  Specifically,  we  present 
the  result  of  a  6-hop  path  (routers  fi,  fi,  f$,  fi,  fe  and  the  destination  /g)  since  our  experiment 
yields  the  same  observation  with  other  path  lengths.  We  simulate  both  an  (i)  independent  packet 
corruption  pattern  where  a  malicious  node  drops/injects  each  packet  independently  with  a  certain 
drop/injection  rate,  and  (ii)  random-period  packet  corruption  pattern  where  the  benign  (non-attack) 
period  T&  and  attack  period  Ta  (when  the  malicious  node  drops/modifies  all  legitimate  packets) 
are  activated  in  turns.  The  durations  for  both  periods  are  randomly  generated.  For  both  attack 
patterns,  we  control  the  average  packet  drop/injection  rates  and  observe  that  both  attack  patterns 
yield  similar  observations.  Hence,  in  the  following  experiment,  we  only  show  the  results  for  the 
independent  packet  corruption  pattern.  Also,  we  infuse  natural  packet  loss  rate  p  for  each  link  to 
simulate  natural  packet  loss,  which  is  not  provided  by  SSFNet.  As  Section  5.4  elaborates  ShortMAC 
security  against  colluding  attacks,  we  only  show  the  representative  results  for  a  single  malicious 
node  fs-  For  each  simulation  setting,  we  run  the  simulation  1000  times  and  present  the  average 
results. 

Against  various  dropping  attacks.  Figure  5.4  depicts  the  detection  delay  N  and  error  rates 
5  with  per-link  natural  loss  rate  p  as  0.5%,  drop  detection  threshold  T^r  as  1%,  and  a  stealthy 
malicious  drop  rate  as  2%. 

We  see  that:  (i)  even  against  stealthy  dropping  attacks  with  a  dropping  rate  as  low  as  2%, 
ShortMAC  can  successfully  localize  a  faulty  link  in  <  2000  packets  with  an  error  rate  5  <  1%, 
which  is  orders  of  magnitudes  faster  than  the  worst-case  theoretical  bound  (Lemmal4).  (ii)  In 
addition,  the  FN  rate  is  always  no  lower  than  the  FP  rate,  because  when  a  FP  occurs  (a  benign 
link  being  falsely  detected)  the  actual  faulty  link  must  have  evaded  detection  for  the  current  epoch 
(ShortMAC  detects  only  one  “faulty”  link  each  epoch),  (iii)  When  N  is  large,  the  FP  and  FN  rates 
are  almost  identical,  because  the  two  rates  are  different  only  when  no  faulty  link  is  detected  (false 
positive  is  0  while  false  negative  is  non-zero),  which  is  unlikely  to  happen  when  N  is  large. 

Figure  5.5  depicts  different  detection  delays  with  different  natural  packet  loss  rates,  demon¬ 
strating  that  larger  | T*.  —  p\  yields  higher  detection  accuracy  and  lower  detection  delay. 
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Figure  5.4:  Natural  loss.  The  malicious  drop  rate  is  2%,  T*.  =  1%,  and  natural  drop  rate  p  =  0.5%. 
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Figure  5.5:  Dropping  attacks.  The  malicious  drop  rate  is  2%,  and  T*.  =  1%. 
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Figure  5.6:  Injection  attacks.  The  malicious  injection  rate  is  2%  using  2-bit  MACs,  natural  loss 
rate  p  =  0.5%,  and  T^r  =  1%. 


Against  various  injection  attacks.  Figure  5.6  shows  the  results  when  injects  packets  at  a 
2%  rate  (relative  to  the  legitimate  packet  sending  rate).  It  shows  that  the  error  rates  stay  below  1% 
in  a  few  hundred  packets,  indicating  that  even  with  2-bit  MACs,  an  adversary  can  only  inject  up 
to  around  ten  packets  without  being  detected.  We  further  investigate  the  effects  of  using  different 
lengths  of  A: — bit  MACs,  and  Figure  5.7  shows  that  the  detection  delay  and  error  rate  dramatically 
diminish  as  k  increases. 


Against  combined  attacks.  Figure  5.8  shows  how  the  combinations  of  dropping  and  injection 
attack  strategies  (in  our  setting,  dropping/injection  rates  are  chosen  between  2%  -  5%)  influence 
the  protocol.  We  observe  that  the  detection  delay  is  mainly  determined  by  the  dropping  detection 
process,  which  is  much  slower  than  the  injection  detection  process.  This  also  indicates  that  a 
malicious  node  cannot  gain  any  advantage  (and  actually  can  only  harm  itself)  by  injecting  bogus 
packets  in  attempt  to  bias  the  counter  values. 

Variance  due  to  different  malicious  node  positions.  To  investigate  the  influence  of  the 
position  of  the  malicious  node,  we  consider  a  path  with  6  forwarding  nodes  fi,  f2,  ■  ■  ■ ,  fe  and  place 
the  malicious  node  at  each  position  (1  to  6)  in  turn.  We  limit  the  error  rate  <  1%  and  obtain  the 
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Figure  5.7:  Effects  of  different  k- bit  MAC  lengths  on  detection  delay  N  and  false  negative  rate  5. 
The  malicious  injection  rate  is  2%,  p  =  0.5%,  and  T^r  =  1%. 
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Figure  5.8:  Combined  attacks,  “drop  p  inject  g”  denotes  the  use  of  p%  dropping  rate  and  q% 
injection  rate  at  / 3. 
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Figure  5.9:  Variance  on  detection  delay  N  in  dropping  attacks.  S  <  1%,  T^r  =  1%,  p  =  0.5%,  and 
both  malicious  dropping  and  injection  rates  set  to  5%. 
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corresponding  detection  delays.  Figure  5.9  shows  one  representative  scenario  where  both  dropping 
and  injection  rates  are  5%.  We  can  see  that  (i)  the  dropping  detection  delay  increases  linearly  when 
the  malicious  node  is  farther  away  from  the  source.  This  is  because  in  the  ShortMAC  detection 
process,  the  source  always  inspects  the  closer  links  first  and  stops  once  the  first  “faulty”  link  is 
detected.  The  FP  rate  thus  increases  when  more  links  exist  between  the  source  and  the  malicious 
node  due  to  natural  packet  loss  on  each  link,  (ii)  In  contrast,  the  injection  detection  delay  exhibits 
little  variance  (cannot  be  seen  from  the  figure  as  the  detection  delay  is  determined  by  the  dropping 
detection),  which  can  also  be  theoretically  proved. 

Comparison  with  recently  proposed  protocols.  For  comparison,  we  simulate  the  full-ACK 
and  PAAI-1  schemes  presented  in  Chapter  4.  Recall  that  full-ACK  is  a  heavy-weight  fault  local¬ 
ization  protocol  requiring  an  Onion  ACK  packet  from  every  forwarding  node  for  every  packet  the 
source  sent.  In  contrast,  PAAI-1  employs  packet  sampling  and  only  requires  acknowledgments  for 
the  securely  sampled  packets  to  reduce  communication  overhead  while  retaining  desired  detection 
delay.  Since  both  Full-ACK  and  PAAI-1  only  consider  packet  dropping  attacks,  we  compare  their 
dropping  detection  delays  along  a  path  with  6  hops  and  fs  as  the  malicious  node.  Figure  5.10 
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Figure  5.10:  Comparison  with  PAAI-1  and  Full-ACK.  The  natural  packet  loss  rate  p  =  0.5%  and 
drop  detection  threshold  T<jr  =  1%. 


ShortMAC 

Full-ACK 

PAAI-1 

Detect,  delay 

20  sec 

20  sec 

8.3  min 

Communication 

0.01% 

100% 

5.6% 

Table  5.2:  Comparison  of  ShortMAC,  Full-ACK,  and  PAAI-1  with  a  source  send  rate  of  100  packets 
per  second. 

shows  the  results  when  per-link  natural  packet  loss  rate  p  =  0.5%  and  drop  detection  threshold 
Tdr  =  1%.  To  make  the  comparison  clear,  we  use  a  metric  of  successful  rate ,  which  equals  to  1  - 
max{ FP  rate,  FN  rate}.  The  results  show  that  the  detection  delays  to  achieve  a  successful  rate 
>  99%  for  ShortMAC,  Full-ACK,  and  PAAI-1  are  2000,  2000,  and  5  x  104,  respectively.  Table  5.2 
shows  their  detection  delays  in  seconds/minutes  and  compares  the  extra  communication  overhead, 
based  on  the  results  from  Figure  5.10  and  with  5  <  1%. 

5.7  Linux  Prototype  and  Evaluation 

We  implement  ShortMAC  source  and  destination  nodes  as  user-space  processes  running  on  Ubuntu 
10.04  32-bit  Desktop  OS.  Even  implemented  in  user-space  on  a  standard  desktop  OS,  our  result 
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shows  that  the  cryptographic  operations  of  ShortMAC  incur  little  communication  degradation  and 
negligible  additional  latency  at  gigabit  line  rate.  It  has  also  been  demonstrated  that  using  mod¬ 
ern  hardware  implementation  and  acceleration  the  speed  of  PRF  functions  can  be  fundamentally 
improved  [52]. 

Implementation  details.  Our  ShortMAC  processes  listen  to  application  packets  via  TUN /TAP 
virtual  interfaces  and  appending  fc-bit  MACs  to  the  packets.  We  also  implement  ShortMAC  routers 
using  the  Click  Modular  Router  [51]  running  on  Ubuntu  10.04  32-bit  Desktop  OS,  which  verify  the 
fc-bit  MACs  in  each  packet  at  each  hop.  To  approach  the  realistic  performance  of  commercial- 
grade  routers,  we  implement  the  above  elements  on  off-the-shelf  servers  with  an  Intel  Xeon  E5640 
CPU  (four  2.66  GHz  cores  with  5.86  GT/s  QuickPath  Interconnect,  256KB  LI  cache,  1MB  L2 
cache,  12MB  L3  cache,  and  25.6  GB/s  memory  bandwidth)  and  12G  DDR3  RAM.  The  servers  are 
equipped  with  Broadcom  NetXtreme  II  BCM5709  Gigabit  Ethernet  Interface  Cards. 

Evaluation  methodology.  We  evaluate  ShortMAC’s  effects  on  communication  throughput  and 
computational  overhead,  especially  due  to  the  generation  and  verification  of  /c-bit  MAC  using 
PRF  operations.  We  utilize  the  widely  used  Netperf  benchmark  [4]  for  the  ShortMAC  throughput 
evaluation,  and  write  our  own  micro-benchmark  for  accurate  latency  evaluation.  We  evaluate 
ShortMAC  with  varying  packet  sizes  by  configuring  the  interface  Maximum  Transmission  Unit 
(MTU)  sizes.  We  evaluate  the  throughput  of  a  ShortMAC  router  and  a  ShortMAC  source  separately 
to  better  illustrate  the  throughput  of  each  component,  while  the  end-to-end  path  throughput  can 
be  easily  derived  by  taking  the  minimum  throughput  of  the  two  evaluation  results.  Then  we 
evaluate  the  end-to-end  latency  with  different  path  lengths  ranging  from  2  to  64.  We  also  exploit 
the  multi-core  parallel  processing  at  the  source  node  via  OpenMP  API  [5]. 

Summary  of  evaluation  results.  The  evaluation  results  of  our  Linux  software  prototype  demon¬ 
strate  that  both  a  ShortMAC  router  and  source  node  can  retain  more  than  92%  of  the  baseline 
throughput  (no  ShortMAC  operations  are  employed).  Furthermore,  the  additional  latency  due  to 
ShortMAC  operations  is  negligible  (tens  of  microseconds)  even  with  a  path  length  of  64  hops.  The 
results  further  indicate  the  ShortMAC  scheme  is  fully  scalable  as  the  number  of  processing  cores 
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increases  in  a  software-based  implementation,  while  we  anticipate  hardware  implementation  of  the 
MAC  operations  in  ShortMAC  can  further  boost  the  protocol  throughput.  Details  of  the  evaluation 
results  are  as  follows. 

Router  throughput  with  different  PRF  implementations.  We  first  evaluate  the  throughput 
of  a  user-level  ShortMAC  router  with  different  PRF  implementations  (i.e.,UMAC  [85],  HMAC- 
SHA1  [53],  and  AES-CMAC  [83])  with  the  support  of  the  new  Intel  AES-NI  instructions  [45]. 
The  ShortMAC  router  connects  a  source  machine  and  a  destination  machine,  with  the  source 
sending  TCP  packets  via  Netperf  as  fast  as  possible  to  the  destination  to  stress-test  the  router.  For 
comparison,  we  use  the  Linux  kernel  forwarding  throughput  without  ShortMAC  operations  as  the 
base  line.  The  ShortMAC  router  runs  as  a  single  user-space  process  without  exploring  parallelism, 
which  already  matches  up  the  base  line  speed  as  shown  below. 

Figure  5.11  depicts  the  results  with  packet  sizes  from  100  to  1500  bytes,  showing  that  UMAC- 
based  PRF  implementation  yields  the  highest  throughput,  which  retains  more  than  90%  of  the 
baseline  throughput  (e.g.,  92%  with  1.5KB  packet  size  and  96%  with  1KB  packet  size  ).  With  a 
small  packet  size  of  100  bytes,  both  the  baseline  and  ShortMAC  throughput  dropped  substantially 
(similar  to  other  public  testing  results  [3]),  because  the  network  drivers  used  in  our  experiments 
are  running  under  interrupt-driven  mode,  which  hampers  throughput  when  packet  receiving  rate 
is  high.  However,  UMAC-based  PRF  still  retains  |lr||=94%  of  the  baseline  throughput. 

Source  node  throughput.  We  further  evaluate  the  throughput  of  a  ShortMAC  source  node 
with  different  path  length  d,  where  for  each  path  length  the  source  needs  to  perform  d—  1  UMAC- 
based  PRF  operations.  Originally,  it  might  seem  that  the  ShortMAC  source  node  represents  the 
throughput  bottleneck  as  the  source  needs  to  compute  multiple  fc-bit  MACs.  However  by  paral¬ 
lelizing  the  ShortMAC  operations  on  readily- available  multi-processor  systems,  the  throughput  of 
a  ShortMAC  source  node  can  fully  cope  with  the  base  line  rate  even  with  a  path  length  of  8.  For 
comparison,  we  use  the  source  node  throughput  without  ShortMAC  operations  as  the  baseline.  We 
evaluate  two  different  parallelizations  based  on  widely  used  OpenMP  [5]  API.  Our  first  implemen¬ 
tation  (internal  parallelism  in  short)  uses  multiple  OpenMP  threads  to  parallelize  the  computation 
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path  length 


Figure  5.12:  Source  throughput. 
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of  multiple  k- bit  MACs  per  packet.  Our  second  implementation  (external  parallelism  in  short) 
assigns  different  packets  to  different  OpenMP  threads. 

We  evaluate  the  ShortMAC  source  throughput  with  various  packet  sizes,  and  observe  that  in 
all  cases  ShortMAC  incurs  negligible  throughput  degradation.  Hence  we  only  show  the  results  with 
packet  size  set  to  1500  bytes  in  Figure  5.12.  We  can  see  that  external  parallelism  yields  the  best 
performance,  which  matches  the  baseline  case  where  the  source  performs  no  ShortMAC  operations. 

ShortMAC  latency.  We  also  evaluate  the  additional  latency  incurred  by  a  ShortMAC  source 
node  for  computing  the  fc-bit  MACs  with  different  path  lengths  and  packet  sizes;  while  the  end- 
to-end  latency  can  be  derived  base  on  our  results.  This  additional  latency  in  ShortMAC  includes 
PRF  computation,  fc-bit  MACs  appending,  and  TCP/IP  checksum  updating.  We  write  our  micro¬ 
benchmark  to  derive  the  additional  time  delay  for  the  source  to  send  each  packet  compared  to  the 
baseline  case  where  the  source  does  not  compute  any  k-hit  MAC  nor  updates  the  checksums. 

Figure  5.13  and  Table  5.3  show  the  results.  We  can  see  that  the  latency  incurred  by  the  checksum 
computation  is  stable.  It  does  not  increase  with  the  packet  size  because  in  our  implementation 
we  employ  incremental  checksum  update  for  the  short  MAC  appended  to  the  packet,  instead  of 
recomputing  the  checksum  over  the  entire  packet.  We  do  not  observe  sharp  increase  of  checksum 
latency  with  increasing  path  length  either  due  to  ShortMAC’s  efficient  fc-bit  MAC  authentication. 
In  addition,  the  latency  caused  by  the  checksum  computation  is  small  compared  to  the  latency 
introduced  by  UMAC-based  PRF  computation.  The  additional  latency  due  to  UMAC  computation 
increases  linearly  to  the  path  length  under  the  same  packet  size,  and  also  increases  linearly  to  the 
packet  size  with  a  fixed  path  length  due  to  the  property  of  the  UMAC  algorithm.  Finally,  compared 
to  the  average  end-to-end  network  latency  which  is  on  the  order  of  milliseconds,  the  additional 
latency  introduced  by  ShortMAC  is  negligible. 

5.8  Discussion  and  Limitations 


Incremental  deployment.  Although  we  argue  it  is  feasible  to  upgrade  all  routers  with  Short¬ 
MAC  within  ISP/enterprise  networks,  we  observe  that  partial  deployment  of  ShortMAC  can  still 
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path  length 


Figure  5.13:  Source  latency. 


Path  Length 

Checksum  (/j, s) 

UMAC  (/zs) 

100 

500 

1000 

1500 

2 

0.0374 

0.1771 

0.4760 

0.8892 

1.4047 

3 

0.0378 

0.3691 

0.9557 

1.7635 

3.3025 

4 

0.0442 

0.5239 

1.4273 

2.6357 

4.0944 

5 

0.0415 

0.7080 

1.9018 

3.5059 

5.4566 

6 

0.0437 

0.8723 

2.3758 

4.3839 

6.8307 

7 

0.0445 

1.0467 

2.8530 

5.2617 

8.2019 

8 

0.0474 

1.2206 

3.3274 

6.1285 

9.5483 

Table  5.3:  ShortMAC  source  node  latency  breakdown  (checksum  updates  and  UMAC  computation). 
All  the  data  represent  the  average  time  of  processing  50000  packets. 
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provide  benefits  and  thus  enables  incremental  deployment.  Specifically,  the  ShortMAC  routers  form 
an  overlay  network  on  top  of  the  physical  network.  In  the  overlay  network,  a  “logical  link”  consists 
of  the  physical  links  between  two  ShortMAC  routers.  The  fault  localization  protocol  runs  only  on 
the  ShortMAC  routers  and  a  data  delivery  fault  will  be  localized  to  a  logical  link.  Although  in  such 
settings  the  source  node  cannot  exactly  identify  a  faulty  physical  link,  it  can  nevertheless  localize 
the  fault  to  a  network  area  (a  set  of  links  between  two  ShortMAC  routers)  to  facilitate  further 
investigation.  Furthermore,  the  more  densely  the  ShortMAC  routers  are  deployed,  the  more  accu¬ 
rate  the  fault  localization  can  be,  which  incentivizes  incrementally  deploying  ShortMAC.  However, 
one  caveat  for  incremental  deployment  is  that  a  discovery  protocol  for  determining  which  routers 
support  ShortMAC  is  needed,  possibly  through  the  use  of  explorer  packets. 

Interdomain  deployment.  Though  ShortMAC  mainly  targets  at  intra-domain  networks  such 
as  ISP  and  enterprise  networks,  ShortMAC  may  also  be  deployed  in  interdomain  networks  such  as 
the  Internet.  In  the  interdomain  setting,  each  Autonomous  System  (AS)  can  represent  a  node  in 
ShortMAC;  the  fault  localization  runs  at  the  AS  level  and  localizes  any  data  delivery  fault  between 
two  ASes.  To  make  ShortMAC  applicable,  different  ASes  need  to  establish  secret  keys  (e.g.,  via 
Passport  [60]),  and  the  egress  router  of  an  AS  needs  to  set  the  TTL  value  of  each  packet  to  the  TTL 
value  at  the  ingress  router  minus  one  to  enable  £>bit  MAC  verification  (Section  5.3.1).  Finally, 
a  source  AS  needs  to  know  the  downstream  AS  path  (which  is  readily  available  in  BGP)  which 
may  dynamically  change  in  the  current  Internet;  however,  the  majority  of  AS  paths  are  stable 
over  minutes  [78]  thus  facilitating  ShortMAC  fault  localization.  If  an  adversary  were  to  constantly 
alter  paths,  it  would  essentially  raise  suspicion  to  itself,  since  path  information  is  visible  and  the 
adversary  needs  to  remain  on  the  path  to  remain  effective. 

Topology  changes  and  short-lived  flows.  Fault  localization  protocols  inevitably  require  at 
least  a  threshold  number  of  packets  to  be  sent  along  the  monitored  path  to  obtain  a  statistically 
accurate  detection  in  the  presence  of  natural  packet  loss.  Hence,  monitored  paths  need  to  be  stable 
over  an  epoch.  Since  ShortMAC  incurs  several  orders  of  magnitude  lower  detection  delay  compared 
to  related  work  [97,  21],  ShortMAC  can  support  topology  or  path  changes  and  short-lived  flows 
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much  better  than  previous  work.  For  example,  as  long  as  the  path  remains  stable  for  transmitting 
around  2000  packets,  the  source  can  make  an  accurate  fault  localization.  While  path  changes  do 
happen  during  an  epoch  (e.g.,  due  to  link  failures),  the  source  will  detect  the  old  link  where  the 
path  is  switched  away  as  faulty.  At  the  same  time,  the  source  can  also  learn  the  routing  updates 
about  the  path  change,  and  by  correlating  the  detection  results  with  routing  updates,  the  source 
may  distinguish  a  benign  path  change  and  a  malicious  packet  misrouting  attack  (in  which  case 
no  corresponding  routing  updates  will  be  received).  However,  the  fault  localization  accuracy  of 
ShortMAC  decreases  for  dynamic  paths  that  transmit  far  fewer  than  2000  packets  before  path 
changes  occur. 

Multipath  routing.  A  ShortMAC  router  maintains  different  counters  for  different  paths,  and 
need  to  know  which  counter  to  update  given  a  certain  packet  (or  which  path  the  packet  belongs  to). 
If  a  source  uses  multiple  paths  simultaneously  to  reach  a  destination,  the  source  and  destination  IDs 
alone  are  no  longer  sufficient  to  identify  a  path.  Instead,  the  source  needs  to  encode  the  path  in  the 
packets  so  that  the  routers  know  which  counters  to  update.  For  example,  in  SCION  routing  [96], 
the  source  embeds  the  path  into  packet  headers,  which  naturally  supports  ShortMAC. 


5.9  Summary 

In  this  chapter,  we  design,  analyze,  implement,  and  evaluate  ShortMAC,  an  efficient  path-based 
fault  localization  protocol,  which  enables  a  theoretically  proven  guarantee  on  data-plane  packet 
delivery  and  substantially  outperforms  related  protocols  in  the  following  aspects.  First,  ShortMAC 
achieves  high  security  assurance  even  in  the  presence  of  strong  adversaries  in  control  of  colluding 
malicious  routers  that  can  drop,  modify,  inject,  and  misroute  packets  at  the  forwarding  paths; 
whereas  a  majority  of  existing  fault  localization  protocols  exhibit  security  vulnerabilities  under 
such  a  strong  adversary  model.  Second,  compared  to  existing  secure  protocols,  ShortMAC  achieves 
several  orders  of  magnitude  lower  detection  delay  and  protocol  overhead,  which  facilitates  its  practi¬ 
cal  deployment.  Finally,  we  demonstrate  that  ShortMAC’s  efficient  cryptographic  operations,  even 
if  implemented  in  software,  have  negligible  effects  on  the  communication  throughput  via  realistic 
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testing  on  Gigabit  Ethernet  links.  We  anticipate  that  ShortMAC  probabilistic  authentication  and 
efficient  fault  localization  can  become  a  basic  building  blocks  for  the  construction  of  highly  secure 
and  efficient  network  protocols. 

The  high  efficiency  of  ShortMAC  facilitates  its  practical  deployment,  and  enables  the  construc¬ 
tion  of  efficient  secure  routing  protocols.  We  thus  anticipate  that  ShortMAC  can  become  a  basic 
building  block  for  the  construction  of  highly  secure  and  efficient  network  protocols.  Though  more 
efficient  compared  to  PAAI,  ShortMAC  requires  changes  to  the  packet  headers  (for  adding  the 
fc-bit  MACs)  while  PAAI  requires  no  changes  to  the  packet  headers.  In  addition,  as  a  path-based 
protocol,  ShortMAC  still  suffers  several  limitations  as  discussed  and  addressed  in  the  next  chapter. 
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Though  PAAI  and  ShortMAC  strive  to  optimize  the  efficiency  of  fault  localization,  theoretically 
proven  lower  bounds  have  shown  that  path-based  fault  localization  protocols  in  the  current  network 
infrastructure  inevitably  incur  prohibitive  overhead.  We  observe  the  current  limits  are  due  to  a  lack 
of  trust  relationships  among  network  nodes.  This  chapter  demonstrates  that  we  can  achieve  much 
higher  fault  localization  efficiency  by  leveraging  trusted  computing  technology  to  design  a  1 -hop- 
based  fault  localization  protocol,  TrueNet,  with  a  small  Trusted  Computing  Base  (TCB).  We  also 
intend  TrueNet  to  serve  as  a  case  study  that  demonstrates  trusted  computing’s  ability  in  yielding 
tangible  and  measurable  benefits  for  secure  network  protocol  designs. 

6.1  Introduction 

Barak  et  al.  recently  proved  the  lower  bound  overhead  of  path-based  fault  localization  protocols  in 
the  current  network  infrastructure  [21],  which  is  impractical  for  large-scale  ISP/enterprise/datacenter 
networks.  Specifically,  the  lower  bound  states  that  a  router  must  share  some  secret  (e.g.,  cryp¬ 
tographic  keys)  with  each  source  sending  traffic  traversing  that  router,  making  the  key  storage 
overhead  at  an  intermediate  router  linear  in  the  number  of  end  nodes.  In  addition,  path-based 
fault  localization  protocols  run  at  the  granularity  of  entire  end-to-end  paths,  requiring  each  inter¬ 
mediate  router  to  store  per-path  state  and  the  paths  to  be  long-lived  (e.g.,  transmitting  at  least  106 
packets,  which  would  hinder  agile  load-balancing  and  traffic  engineering)  [21,  97].  These  fundamen- 
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tal  limitations  exist  in  traditional  network  infrastructure  due  to  the  lack  of  any  trust  relationships 
among  nodes.  Hence,  a  source  node  needs  to  directly  check  or  monitor  all  intermediate  routers 
(thus  sharing  secret  keys  and  state)  in  the  routing  path  to  ensure  the  routers  behave  correctly. 

Furthermore,  in  existing  secure  fault  localization  protocols,  a  node  n  which  detects  a  faulty  link 
l  can  only  remove  l  from  n’s  local  routing  table  but  cannot  share  the  detection  result  with  other 
nodes,  otherwise  a  potentially  malicious  n  make  false  accusation  of  other  benign  links  ( slander 
attacks).  This  retards  the  network-wide  detection/failure  recovery  process,  and  causes  inconsistent 
routing  tables  at  different  nodes  (faulty  links  excluded  from  the  routing  tables  of  some  but  not  all 
nodes).  Inconsistent  routing  tables  violate  the  requirements  of  certain  routing  protocols  such  as 
link-state  routing.  The  lack  of  trust  among  network  nodes  also  inhibits  the  global  sharing  of  local 
detection  result. 

In  light  of  the  fault  localization  limitations  in  current  network  infrastructures,  we  explore  how 
trusted  computing  technology  can  enable  a  network  architecture  with  intrinsic  trust  of  correct  data 
delivery  among  nodes  with  fundamentally  better  performance  than  the  proven  boundaries  [21]  in 
a  traditional  network  architecture.  Our  key  insight  is  that  remote  code  attestation  provided  by 
trusted  computing  enables  a  node  to  verify  if  a  remote  communicating  node  runs  a  trusted  (or 
expected)  version  of  software/protocol  via  authenticated  “code  measurements”.  Isolation  further 
ensures  that  critical  code  execution  and  data  are  isolated  from  all  other  code  and  devices  on  the 
local  system.  Jointly,  these  properties  provide  transitivity  of  verification,  i.e.:  if  A  verifies  B’s 
code  integrity  (via  attestation  and  isolation)  and  B  verifies  C ,  then  A  believes  in  C’s  code  integrity 
as  well  without  needing  to  verify  C ’s  code  integrity,  because  A  knows  B ’s  code  has  correctly  verified 
C .  Transitivity  of  verification,  when  applied  to  secure  network  protocol  designs,  enables  each  node 
to  perform  verification  and  monitoring  only  with  1-hop  neighbors ,  building  a  chain  of  verification 
over  the  end-to-end  path  with  reduced  overhead,  i.e.,  only  requiring  per-neighbor  (as  opposed  to 
per- node  or  per-path)  state  at  each  router.  In  short,  transitivity  of  verification  eliminates  the  need 
of  establishing  direct  point-to-point  validation  between  any  two  nodes  in  the  network  which  incurs 
high  storage  overhead  and  obstructs  key  management. 

Though  useful,  current  trusted  computing  technologies  are  by  no  means  a  panacea  when  directly 
applied  to  the  realm  of  computer  networks.  Although  several  researchers  propose  Trusted  Platform 
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Module  (TPM)-based  protocols  for  securing  general  distributed  systems  (e.g.,  BIND  [82])  and 
specific  network  applications  (e.g.,  Not-a-Bot  [40]),  fundamental  challenges  render  these  approaches 
ineffective  in  securing  data  delivery  at  the  network  layer:  (i)  existing  approaches  cannot  “attest” 
raw  command-line  configuration  for  which  an  expected  “measurement”  for  remote  attestation  is 
hard  to  define,  (ii)  the  extensive  network  stack  would  swell  the  size  of  the  Trusted  Computing  Base 
(TCB)  and  it  is  challenging  to  abstract  a  small-sized,  invariant  “critical  code” ,  and  (iii)  a  large  ISP 
network  can  contain  different  routing  instances  with  different  implementation  versions  [57],  which 
obstructs  the  use  of  a  consistent  “code  measurement”  for  attestation. 

The  TrueNet  design  answers  these  challenges  of  applying  trusted  computing.  Instead  of  strictly 
attesting  the  semantics  of  the  huge,  intertwined  network  stack  itself,  TrueNet  attests  the  behavior 
of  the  network  stack,  i.e.,  whether  it  has  successfully  delivered  the  data  or  not.  On  one  hand, 
the  success  of  data  delivery  guarantees  that  all  of  the  network-layer  components  have  worked 
correctly,  regardless  of  their  implementation  variations.  On  the  other  hand,  if  any  of  the  network- 
layer  components  misbehaves,  failures  will  arise  in  data  delivery  by  which  the  faulty  link(s)  can  be 
detected.  Correspondingly,  our  approach  in  TrueNet  is  to  monitor  Chop  data  delivery  behavior 
(behavior  of  the  network-layer  protocol  stack)  with  a  small  monitoring  module  as  the  critical  code 
at  each  hop,  and  attest,  isolate,  and  protect  only  the  particular  monitoring  module  with  trusted 
computing.  Thus,  TrueNet  requires  only  a  small  amount  of  critical  code  (the  small  monitoring 
module)  as  the  TCB.  Such  a  small  TCB  size  (i)  supports  different  network  stack  implementations 
and  flexible  protocol  updates,  (ii)  makes  the  attestation  of  the  small  critical  code  efficient,  and 
(iii)  enables  applying  formal  analysis  [28]  on  the  small  critical  code  to  ensure  the  TCB  is  indeed 
trustworthy. 

The  small  TCB  on  each  TrueNet  router  forms  a  logical  protected  path  overlayed  on  the  physical 
machines  and  an  untrusted  network  stack  between  a  source  and  destination,  along  which  data 
delivery  is  monitored  and  ensured.  As  a  result,  TrueNet  achieves  efficient  fault  localization  with 
small  router  state  (only  per-neighbor  state),  support  for  dynamic/short-lived  paths  (no 
requirements  on  the  minimum  number  of  packets  transmitted  along  a  path  since  monitoring  is 
performed  only  between  neighbors),  and  global  sharing  of  detection  results  while  eliminating 
slander  attacks.  As  a  proof  of  concept,  we  implement  a  TrueNet  prototype  in  Linux  using  existing 
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trusted  computing  technology  and  a  TPM,  and  demonstrate  that  TrueNet  provides  high  throughput 
while  achieving  the  desired  security  properties.  We  also  launch  real  trace-based  measurements  to 
show  that  the  router  state  in  TrueNet  is  up  to  five  orders  of  magnitude  less  than  related  work  [21,  97] . 

Contributions.  We  design,  implement,  and  evaluate  TrueNet,  which,  assuming  trusted  hard¬ 
ware,  achieves  secure  fault  localization  with  properties  (i.e.,  per- neighbor  router  state,  dynamic 
path  support,  and  global  sharing  of  fault  localization  results  while  avoiding  slander  or  framing 
attacks)  that  invalidate  the  previously  proven  performance  boundaries  in  traditional  networks  [21]. 
TrueNet  still  provides  benefits  for  partial  adoption,  enabling  incremental  deployment,  and  can  be 
deployed  in  inter-domain  settings  with  the  recently  proposed  SCION  architecture  [96].  Finally, 
TrueNet  explores  the  role  trusted  computing  might  play  in  securing  network  protocols,  shows  the 
possibility  of  using  trusted  computing  to  break  traditional  performance  boundaries,  and  could  spark 
future  research. 

6.2  Setting 

Besides  the  problem  formulation  described  in  Chapter  2,  we  introduce  additional  assumptions  and 
definitions  for  this  chapter  below. 

Definition  17.  We  denote  by  5ab  =  {^abj^ab!  the  number  of  original  packets  dropped  and 
misrouted  (dAB)>  and  the  number  of  packets  injected,  modified,  and  reordered  (dAB)  on  Iab-  A 
link  Iab  is  faulty  if  5ab  is  larger  than  a  certain  accusation  threshold  {Tdr,Tin}  set  by  the  network 
administrator,  i.e.: 

(6.1)  5dAB  >  Tdr ,  or  5fAB  >  Tin. 

Definition  18.  Aggregate  fault  localization  is  achieved  iff  given  a  routing  path  p,  5ab  can  be 
accurately  learned  for  each  link  Iab  in  p.  Per-packet  fault  localization  is  achieved  iff  given  the 
routing  path  p  the  failure  of  delivering  a  single  packet  in  p  can  be  immediately  localized  to  a  specific 
link  in  p. 
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Adversary  Model.  We  follow  the  trusted  computing  literature  and  assume  the  adversary  can 
compromise  the  router  OS,  install  malware  on  the  routers,  and  launch  remote  software-based  at¬ 
tacks;  but  the  adversary  cannot  compromise  hardware  or  manipulate  the  physical  network  infras¬ 
tructure,  nor  defeat  trusted  computing  primitives  (code  attestation  and  isolation).  Such  a  remote 
attacker  model  is  consistent  with  real-world  router-based  attacks.  For  example,  most  documented 
router  compromises  in  ISP  and  enterprise  networks  are  due  to  phishing  [7]  and  remote  exploitation 
of  router  software  vulnerabilities  [2,  13]  and  weak  passwords  [41]  by  remote  hackers  [87].  In  addition, 
a  majority  of  network  operators  in  a  recent  security  survey  [1]  listed  router  misconfiguration,  which 
also  falls  under  our  software-based  attack  model,  as  an  important  cause  of  outages;  and  documented 
router  software  misconfiguration  has  led  to  network  partitioning  [55].  Finally,  software-based  at¬ 
tacks  are  usually  more  stealthy  and  large-scale  than  hardware-based  attacks,  since  a  hardware-based 
attacker  usually  needs  physical  proximity  to  targeted  routers  and  will  likely  leave  physical  evidence, 
making  the  attack  more  auditable  and  less  scalable. 

The  adversary  controls  multiple  malicious  routers  which  can  drop,  modify,  inject,  reorder,  and 
misroute  packets  on  links  incident  to  malicious  nodes  in  control.  Furthermore,  the  adversary  can 
launch  collusion  attacks  where  multiple  malicious  routers  can  coordinate  and  conspire  to  evade 
fault  localization  or  incriminate  a  benign  link.  However,  the  adversary  has  polynomially  bounded 
computational  power  and  cannot  break  cryptographic  primitives. 


6.3  Fundamental  Challenges 

We  further  elaborate  on  the  fundamental  challenges  in  directly  applying  code  attestation  and 
isolation  to  secure  data  delivery  in  large-scale  networks. 

Large  protocol  stack.  The  network  layer  contains  numerous  interacting  software  components, 
i.e. ,  (i)  topology  discovery,  (ii)  path  selection  from  the  topology,  (iii)  converting  routing  tables  to 
forwarding  tables,  (iv)  forwarding  table  lookup,  etc.  The  incorrect  operation  of  any  of  these  compo¬ 
nents  will  hamper  the  correctness  of  the  eventual  network  data  delivery;  therefore,  straightforward 
attestation  of  the  entire  protocol  stack  would  require  attesting  tens  of  thousands  of  lines  of  code. 
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For  example,  the  IPv4  subsystem  in  the  Linux  2.6.37  kernel  contains  more  than  66K  lines  of  code, 
and  the  IP-related  elements  in  the  Click  modular  router  [51]  contain  more  than  15K  lines  of  code. 
This  swells  the  TCB  size  and  thus  broadens  the  surface  for  potential  vulnerabilities. 

Diverse  implementations  and  complex  dependencies.  In  practice,  there  can  be  many  co¬ 
existing  protocol  implementations  and  instances  [57]  within  the  same  large  ISP  or  enterprise  net¬ 
work.  Furthermore,  due  to  the  intrinsic  and  obscure  interactions  between  network-layer  compo¬ 
nents,  it  is  highly  challenging  to  distill  an  invariant,  small,  infrequently  updated  critical  code  as 
TCB  to  be  attested. 

Securing  raw  user  input/configuration.  In  addition  to  the  network  protocol  stack,  data 
delivery  also  depends  on  human  command-line  input  and  configuration.  Unfortunately,  user  con¬ 
figurations  are  hard  to  attest  due  to  the  flexibility  of  the  configuration  language,  but  can  be  utilized 
by  the  attackers  to  launch  attacks  to  sabotage  data  delivery.  Since  the  current  Cisco  IOS  provides 
rich  command-line  interfaces  to  drop  and  alter  packets,  an  attacker  can  cause  damage  without  even 
modifying  the  network  stack. 

Hence  in  this  paper,  we  strive  to  address  these  challenges  by  ascertaining  the  minimal,  invariant 
critical  code  for  securing  network  data-plane  packet  delivery,  along  with  its  minimal  configuration 
parameters. 

6.4  Design  Building  Blocks 

Remote  attestation,  isolation,  and  sealed  storage  are  the  high-level  primitives  that  trusted  comput¬ 
ing  offers  pertaining  to  our  purpose  of  securing  network  data  delivery. 

Trusted  computing  primitives.  By  remotely  attesting  a  selected  piece  of  “critical  code”,  a 
node  X  can  verify  if  a  remote  node  Y  is  executing  the  expected,  correct  version  of  the  critical 
code.  In  conjunction  with  isolation,  attestation  can  ensure  that  the  execution  of  the  critical  code 
occurs  untampered  by  any  potentially  present  malicious  code  including  the  OS.  Specifically,  with 
attestation  of  the  1-hop  monitoring  module  as  the  critical  code  in  TrueNet,  a  node  X  can  convince 
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another  node  Y  that  X  is  indeed  executing  the  correct  monitoring  module  in  an  isolated  fashion. 
Furthermore,  sealed  storage  binds  a  piece  of  sensitive  data  to  a  particular  piece  of  software,  ensuring 
that  only  the  software  that  originally  sealed  the  data  accesses  it.  In  TrueNet,  sealed  storage  can 
seal  a  monitoring  module’s  secret  keys  so  that  only  the  same  monitoring  module  can  access  the 
secrets. 

These  trusted  computing  primitives  have  been  widely  deployed  on  commodity  computers  [39, 
44],  In  the  remainder  of  the  paper,  we  first  use  these  trusted  computing  primitives  conceptually  for 
presenting  the  TrueNet  protocol.  Then  we  delineate  and  implement  a  TrueNet  router  architecture 
incorporating  the  trusted  computing  primitives  in  Sections  6.10  and  6.11. 


Security  properties.  Remote  attestation  and  sealed  storage  can  be  used  to  set  up  secure  chan¬ 
nels  and  transitivity  of  monitoring  results  as  the  security  properties  leveraged  by  TrueNet  for 
efficiently  achieving  fault  localization. 

1 )  Secure  channel:  The  above  trusted  computing  primitives  enable  a  monitoring  module  MM^  to 
generate  and  convey  its  public  key  to  a  remote  MM^  [70],  based  on  which  MM^  and  MM#  can 
establish  a  shared  secret  key.  By  performing  cryptographic  operations  using  the  secret  keys  sealed 
and  only  known  by  the  trusted  monitoring  modules  at  network  routers,  a  compromised  router 
OS  or  malware  cannot  impersonate  the  monitoring  module  by  forging  signatures  or  performing 
encryption/decryption  based  on  those  sealed  keys.  This  builds  a  secure  communication  channel 
among  the  monitoring  modules  at  different  routers. 

2)  Transitivity  of  monitoring  results:  End-to-end  monitoring  can  now  be  achieved  via  a  chain  of 
1-hop  monitoring  between  every  two  adjacent  neighbors  while  eliminating  slander  and  collusion 
attacks.  This  is  because  if  a  node  X  verifies  via  code  attestation  that  its  neighbor  Y  is  executing 
the  correct  monitoring  module  MMy,  X  knows  that  the  monitoring  results  reported  by  MMy  are 
correct,  and  that  MMy  is  correctly  monitoring  y’s  neighbor,  which  recursively  ensures  the  entire 
end-to-end  path  is  being  correctly  monitored. 
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actual  packet  path 


Figure  6.1:  An  example  topology  to  illustrate  the  operation  of  TrueNet.  The  solid  line  represents 
the  logical  protected  path  of  packets  implemented  by  the  secure  channels  between  the  trusted 
monitoring  modules. 


6.5  TrueNet  Overview 


We  give  an  overview  of  TrueNet  with  Figure  6.1  as  an  example  topology.  The  shaded  areas  denote 
the  monitoring  modules  isolated  and  protected  by  trusted  computing  at  each  router  and  thus  reside 
in  the  TCB.  A  router’s  network  stack  (including  the  OS,  network  interfaces,  and  other  related 
programs)  is  untrusted. 


The  logical  protected  path.  In  TrueNet,  each  packet  is  supposed  to  pass  through  the  monitor¬ 
ing  module  MM*  at  each  hop  i.  The  MMs  on  the  logical  path  are  protected  by  trusted  computing 
mechanisms  and  are  thus  trusted.  The  dashed  line  in  Figure  6.1  depicts  the  actual  packet  path 
comprising  the  physical  machines  and  network  stack,  originated  from  node  S  and  destined  to  D. 
In  contrast,  the  secure  channels  between  adjacent  trusted  monitoring  modules  along  the  actual 
packet  path  form  a  logical  protected  path  overlayed  on  the  untrusted  network  stack.  Every  two 
neighboring  MM^  and  MMg  on  the  logical  protected  path  share  a  secret  key  Kab  that  is  sealed 
by  and  only  accessible  to  the  same  MMyi  or  MM#.  Nodes  (i.e.,  monitoring  modules)  in  the  logical 
protected  path  can  thus  communicate  with  secrecy  and  authenticity  using  the  shared  and  sealed 
secret  keys,  and  the  untrusted  network  stack  cannot  inject  or  forge  authenticated  messages  in  the 
logical  protected  path.  Nodes  in  the  logical  protected  path  can  also  attest  to  each  other  that  the 
MMs  are  indeed  intact  and  trusted. 

The  formation  of  this  logical  protected  path  requires  only  per-neighbor  key  storage  yet  greatly 
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facilitates  secure  fault  localization.  Specifically,  each  MMj  maintains  a  local  data  structure  (e.g.,  a 
counter)  to  reflect  the  reception  of  each  packet  as  the  “packet  footprint” .  In  this  way,  each  packet 
should  leave  a  certain  footprint  at  each  hop’s  monitoring  module  iff  the  packet  is  successfully 
delivered  along  the  logical  protected  path.  Later  by  comparing  the  packet  footprints  left  at  every 
two  neighbors  MM^  and  MM#  in  the  logical  protected  path,  it  either  confirms  that  the  packets 
have  been  successfully  delivered  (if  the  footprints  match)  or  some  problem  occurs  between  MM^ 
and  MMg  (if  the  footprints  do  not  match).  The  secrecy  and  authenticity  properties  of  the  logical 
protected  path  ensure  that  the  footprints  reported  by  each  MMj  will  not  be  forged  or  injected  by 
a  malicious  network  stack  or  malware. 

Localizing  a  faulty  link.  Note  that  TrueNet  detects  a  faulty  link  between  two  adjacent  MMs, 
instead  of  a  specific  malicious  router.  In  this  way,  MMs  do  not  rely  on  the  untrusted  network 
stack  or  NIC  to  correctly  deliver  packets  to  the  MMs:  if  the  NIC  or  network  stack  of  a  router 
M  drops  or  modifies  packets  before  sending  to  MM^,  faults  will  be  localized  between  MMm  and 
its  neighboring  MMs.  For  example  in  Figure  6.1,  if  the  malicious  OS  or  a  malware  in  router  A 
corrupts  or  drops  the  packet  before  it  reaches  MM^,  then  the  footprint  that  packet  leaves  at  MM^ 
will  differ  from  that  at  MMs,  thus  causing  link  Isa  to  be  detected  as  we  show  shortly. 

Small  TCB.  The  TCB  in  TrueNet  only  includes  the  trusted  computing  primitives  and  the  pro¬ 
tected  MM.  Due  to  the  challenges  outlined  in  Section  6.3,  it  is  impractical  to  include  the  entire 
network  stack  and  NIC  in  the  TCB  or  for  code  attestation.  Due  to  those  challenges,  one  cannot 
simply  use  attestation  to  determine  if  the  local  OS  or  NIC  is  compromised  and  stop  any  malicious 
system. 

TrueNet  fault  localization  phases.  From  a  high  level,  TrueNet  consists  of  setup,  1-hop  mon¬ 
itoring,  and  global  accusation  phases  as  sketched  below. 

1)  Setup:  During  protocol  setup,  an  administration  entity  of  the  network  installs  a  public/private 
key  pair,  a  public  key  Kacimin  °f  the  administration  entity,  and  a  neighbor  list  to  each  node.  Every 
two  neighbors  A  and  B  establish  a  shared  secret  key  Kab ,  which  is  used  to  authenticate  the 
messages  exchanged  between  MM^  and  MMg  in  the  logical  protected  path.  The  administration 
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entity  signs  the  neighbor  list  along  with  a  version  number  using  its  private  key  K  J  ■  .  The  node 
private  key,  MAC  key  and  Kab  are  sealed  by  and  only  accessible  to  the  local  monitoring  module. 

2)  1-hop  monitoring:  To  implement  the  secure  channel  between  neighboring  MMs  in  the  logical 
path,  a  MMa  computes  a  Message  Authentication  Code  (MAC)  for  each  packet  sent  to  the  next- 
hop  MM^  in  the  logical  protected  path  using  Kab •  By  verifying  the  MAC,  MM^  can  be  convinced 
that  its  neighbor  A  is  running  the  correct  monitoring  module  otherwise  Kab  cannot  be  retrieved 
for  authentic  MAC  generation.  Similarly,  by  authenticating  the  footprint  reports,  a  node  can  be 
convinced  that  its  neighbors  are  telling  the  correct  footprints  and  having  correctly  monitored  their 
neighbors  in  the  logical  protected  path,  otherwise  the  sealed  key  cannot  be  retrieved  for  authenti¬ 
cating  the  reports.  This  chain  of  1-hop  monitoring  ensures  all  links  in  a  logical  protected  path  have 
been  correctly  monitored. 

TrueNet  provides  two  types  of  1-hop  monitoring  primitives  in  the  monitoring  modules,  namely, 
per-packet  monitoring  and  aggregate  monitoring  for  achieving  per-packet  fault  localization  and  ag¬ 
gregate  fault  localization,  respectively.  These  two  monitoring  approaches  differ  in  the  footprint  data 
structure  and  how  frequently  footprints  are  compared  between  neighbors.  In  per-packet  monitor¬ 
ing,  a  monitoring  module  MMg  maintains  an  identifier  (e.g.,  a  sequence  number)  for  each  received 
packet  with  a  correct  MAC  computed  by  MM^,  and  sends  back  an  acknowledgment  (ACK)  to 
MMa  for  each  received  packet  from  MMa  immediately.  In  aggregate  monitoring  in  contrast,  MM^ 
increments  a  counter  if  a  packet  received  from  the  neighbor  MMa  contains  a  correct  MAC  com¬ 
puted  by  MMa-  Then  MMg  exchanges  the  counters  with  its  neighbor  across  the  logical  protected 
paths  periodically.  Hence,  aggregate  monitoring  reduces  the  communication  overhead  and  tells  how 
many  packets  have  been  dropped  or  corrupted  between  every  two  neighbors  in  the  logical  protected 
paths,  while  per-packet  monitoring  provides  more  fine- grained  and  immediate  information  about 
which  packets  have  been  corrupted  between  two  neighbors  in  the  logical  protected  paths,  enabling 
instant  failure  recovery  (e.g.,  by  immediately  retransmitting  the  corrupted  packets  at  the  network 
layer  on  a  per- link  basis).  In  both  monitoring  approaches,  MMs  add  additional  per-neighbor  se¬ 
quence  numbers  for  the  data  packets,  which  are  used  to  prevent  replay  and  reordering  attacks  and 
identify  dropped  packets. 

3)  Global  accusation:  A  monitoring  module  MMa  constantly  asks  for  the  footprint  reports  from 
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each  neighbor  MM^  to  learn  5ab-  If  MMyi  observes  an  abnormally  large  5ab  on  a  link  Iab 
in  the  logical  protected  path,  MM^  sends  out  an  accusation  message  to  its  1-hop  neighbors  in 
the  logical  protected  path  which  can  verify  and  accept  the  message  based  on  authentic  MACs. 
Similarly,  the  neighbors  of  MM^  in  the  logical  protected  path  further  tell  their  neighbors  about  the 
accusation.  This  process  recursively  achieves  network-wide  trustworthy  broadcasting  (Section  6.8). 
Hence,  all  the  network  nodes  remove  faulty  links  from  their  routing  tables  upon  identification. 
Such  consistency  of  routing  tables  further  accelerates  network-wide  failure  recovery,  enabling  the 
use  of  link-state  routing  which  remains  the  de  facto  routing  protocol  for  contemporary  intra-domain 
networks. 

Small  router  state  and  support  for  dynamic  paths.  Note  that  in  any  phase,  attestation 
and  authentication  are  only  performed  between  two  neighbors;  thus  each  node  only  maintains  per- 
neighbor  state.  Such  1-hop  operations  also  eliminate  the  need  for  long-lived  and  stable  paths, 
facilitating  load  balancing. 

The  following  sections  detail  each  phase  of  TrueNet. 


6.6  TrueNet  Setup 

In  the  setup  phase,  a  local  network  administrator  remains  responsible  for  setting  up  and  updating 
a  router  with  appropriate  cryptographic  keys  and  its  neighbor  list  as  follows. 

Day  Zero  setup.  The  first  time  a  router  i  physically  joins  a  network,  the  network  administrator 
(i)  launches  a  monitoring  module  MMj  on  router  i  and  ensures  that  MM,  is  securely  loaded  and 
protected  by  the  trusted  computing  primitives  on  router  i.  (ii)  The  administrator  installs  a  public 
key  Kazmin  of  the  administration  entity  of  the  network  into  MM*  and  ensures  that  MMj  has  cor¬ 
rectly  loaded  and  protected  Kadmin  for  verifying  future  messages  from  the  administrator,  (iii)  The 
administrator  creates  and  installs  a  public/private  key  pair  Ki/K ~  and  a  neighbor  list  NLj  for 
router  i,  along  with  a  version  number  and  a  signature  created  using  its  private  key  K~}m i  .  The 
private  key  K~ 1  is  sealed  and  only  accessible  to  MMj.  (iv)  Each  router  i  exchanges  a  secret  key 
Kij  with  each  of  its  neighbors  j  using  their  public/private  key  pairs  [70].  Kjj  is  sealed  and  only 
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Source 

Router  A 

SI) 

OS s  ->  MM5: 

packet  m 

S2) 

MMj  genPkt: 

Ms  <-  m,N§A,MACKsA{m\\S\\N§A) 

S3) 

MM5  awaitACK: 

store  NgA .  start  timer 

S4) 

MMg  incrSN: 

N§A  NgA  +  1 

S5) 

MMS  0S5: 

Ms 

ms 

Al) 

OSA  —  MM  a 

Ms 

A2) 

MMa  validatePkt: 

if  Ms  invalid,  accuse  Isa 

A3) 

MMa  genACK: 

ackAs  A,  N£a,  MACifsA(A||iV^4) 

A4) 

MMa  incrSN: 

n£a<-n§a 

S6) 

0S5  -►  MM5: 

ackAS 

aqkAS 

A5) 

MMa  OSa: 

ackAs 

S7) 

MMg  verifyACK: 

if  ackAS  invalid,  accuse  Isa 

A6) 

MMa  updatePkt: 

Ma  -  m,  Nab,  MAC KAB(rn\\A\\NAB) 

A7) 

MMa  awaitACK: 

store  Nab,  start  timer 

A8) 

MMa  incrSN: 

*ab  ^  Nab +  1 

A9) 

MMa  ->  OSa: 

Ma  =>  further  sent  to  router  B 

Table  6.1:  TrueNet  per-packet  monitoring.  Shaded  instructions  are  functions  of  the  monitoring  module  MM^  which  is  in 
the  TCB.  MACx(ra)  denotes  a  Message  Authentication  Code  (MAC)  computed  over  m  using  the  symmetric  key  K. 
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accessible  to  MMj  and  MMj ,  and  is  used  for  constructing  the  secure  channel  between  MM*  and 
MMj. 

Incremental  updates.  After  Day  Zero  setup,  the  administration  entity  uses  the  public  key 
K admin  to  authenticate  all  its  update  messages  to  the  routers  (e.g.,  when  updating  NLj  or  Kj). 
These  control  messages  from  the  administration  entity  will  be  protected  by  per-packet  monitoring 
as  we  describe  below.  The  MMs  run  at  routers  are  responsible  for  verifying  the  authenticity  of 
these  updates  messages  using  Kadmin-  The  neighboring  nodes  i  and  j  can  periodically  update  their 
shared  secret  key  Kij.  However,  this  paper  omits  the  details  of  handling  these  updates  due  to  space 
limitation. 

6.7  TrueNet  1-Hop  Monitoring 

Given  an  end-to-end  communication  path  p,  1-hop  monitoring  in  TrueNet  ensures  that  the  data 
sent  by  the  source  will  be  correctly  delivered  to  the  destination  along  p,  otherwise  a  faulty  link  in  p 
that  tampers  with  correct  data  delivery  will  be  localized  and  accused.  Thus,  we  assume  the  source 
node  can  learn  path  p  (e.g.,  from  link-state  routing,  source  routing,  or  recent  centralized  routing 
protocols  like  4D  [37],  SANE  [27]  or  ETHANE  [26]),  which  is  a  common  requirement  for  all  existing 
secure  fault  localization  schemes.  We  first  detail  each  of  per-packet  and  aggregate  monitoring,  and 
then  discuss  their  usage  scenarios  in  Section  6.7.3. 

6.7.1  Per-packet  Monitoring 

We  use  Figure  6.1  as  an  example  to  illustrate  TrueNet  per-packet  monitoring.  Table  6.1  shows 
the  interactions  between  the  source  S  and  the  first  hop  router  A  for  transmitting  and  protecting  a 
single  packet.  Subsequent  routers  in  path  p  will  perform  identical  operations  as  router  A. 

Packet  generation.  Upon  receiving  a  packet  m  with  path  p  embedded  from  the  network  stack 
(OSs)  of  the  source  S,  the  trusted  monitoring  module  MMs  wraps  the  packet  into  A 4s  with  a 
per-neighbor  sequence  number  NgA  for  the  next-hop  router  A,  and  a  MAC  computed  over  m  and 
NgA  with  the  secret  key  K$a  shared  between  MMs  and  MM^  (Table  6.1  S2).  Meanwhile,  router  A 
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maintains  a  per-link  sequence  number  NgA  remembering  the  last  sequence  number  for  the  packets 
sent  from  S  to  A.  Note  that  only  one  MAC  for  the  next  hop  is  attached  (as  opposed  to  attaching 
one  MAC  for  each  router  in  the  path),  because  the  transitivity  of  verification  provided  by  trusted 
computing  enables  the  chaining  of  trusted  1-hop  verifications  to  achieve  end-to-end  guarantees. 

As  it  transmits  the  packet,  MMg  starts  a  timer,  expecting  to  receive  an  ACK  from  the  next-hop 
receiver  MM^  within  the  allocated  time,  allowing  MM5  to  determine  whether  MM^  successfully 
received  the  packet.  For  this  purpose,  NgA  is  temporarily  stored  as  the  packet  identifier  until  the 
timer  expires  (Table  6.1  S3).  MMg  then  increments  N<jA  for  the  next  packet  to  be  sent  to  prevent 
packet  replay  and  reordering  attacks  (Table  6.1  S4),  and  sends  Ms  back  to  OS5,  which  in  turn 
forwards  Ms  to  router  A. 

Packet  reception.  Each  received  packet  is  expected  to  be  passed  through  the  monitoring  module 
at  each  hop.  At  router  A,  MM^  first  validates  the  received  packet  A is  via  validatePkt  (Table  6.1 
A2),  which  includes  checking  the  sequence  number,  the  next  hop,  and  the  MAC  as  follows: 

1)  validatePkt  first  checks  if  the  per-neighbor  sequence  number  NgA  contained  in  Ms  matches 
the  locally  stored  per-neighbor  NgA  value.  If  the  values  differ,  indicating  a  replay,  re-ordering,  or 
packet  injection,  validatePkt  terminates  (skipping  the  following  checks)  and  returns  “invalid”. 

2)  validatePkt  then  retrieves  the  next  hop  from  path  p  embedded  in  Ms,  and  checks  if  the 
local  router  A  is  indeed  the  next  hop  in  p  for  the  current  communication  flow.  An  inconsistency 
indicates  the  previous  router’s  OS  used  a  wrong  interface  (packet  misrouted),  and  validatePkt 
terminates  returning  “invalid”. 

3)  validatePkt  finally  checks  the  MAC  in  Ms,  and  returns  “invalid”  if  the  MAC  is  incorrect. 

If  validatePkt  outputs  “valid”,  MM^  generates  an  ACK  including  NgA  as  the  packet  iden¬ 
tifier  with  a  MAC  (Table  6.1  A3),  which  MMj  awaits.  MM^  then  increments  the  local  per- 
neighbor  sequence  number  N^A  (Table  6.1  A4)  to  prevent  packet  replay  and  reordering  attacks.  If 
validatePkt  returns  “invalid”,  MM^  believes  that  forwarding  misbehavior  occurs  between  MMg 
and  MMa  (denoted  by  Isa)-  MM^  generates  an  accusation  if  the  failure  rate  remains  high  with 
efficient  trustworthy  broadcasting  (Section  6.8),  or  signals  MM5  in  the  ACK  for  instant  failure 
recovery  as  we  show  shortly. 
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Packet  forwarding.  If  the  packet  validation  succeeds,  the  original  MAC  embedded  in  the  re¬ 
ceived  packet  Ms  is  replaced  with  a  new  one  computed  for  the  next  hop  MM#  using  the  sealed 
secret  key  Kab  shared  between  MM^  and  MM^;  and  the  per- neighbor  sequence  number  is  also 
replaced  with  the  one  {Nab)  f°r  traffic  between  MM^  and  MM#  (Table  6.1  A6).  Right  before  the 
updated  packet  Ma  departs  MM^,  MM^  also  starts  a  timer  and  expects  an  authenticated  ACK 
from  the  next-hop  MMg  (Table  6.1  A7).  Finally,  MM^  increments  the  per- neighbor  sequence 
number  NAB  for  the  next-hop  B  to  prevent  packet  replay  and  reordering  attacks  (Table  6.1  A8). 

ACK  reception  and  failure  recovery.  Upon  receiving  an  ACK  ack,AS  from  a  neighbor  router 
A  (Table  6.1  S6),  MMg  checks  if  the  corresponding  packet  identifier  {NBA  in  this  case)  is  still 
stored  indicating  the  timer  has  not  expired.  Then  MMg  checks  if  the  MAC  is  correct.  If  any  check 
fails,  MMg  can  either  re-transmit  the  particular  corrupted  packet  up  to  r  times  for  instant  failure 
recovery ,  or  globally  accuses  Isa  for  failing  to  deliver  any  of  the  r  +  1  packets  corresponding  to  NgA 
via  trustworthy  broadcasting.  The  number  of  re-transmissions  r  is  introduced  and  set  to  tolerate 
spontaneous  packet  loss.  E.g.,  assuming  an  upper  bound  p  (probability)  of  packet  loss  rate  and  an 
upper  bound  e  of  allowed  false  positive  rate,  we  should  set  r  >  —  1. 


Optimization.  Similar  to  the  TCP  acknowledgment  mechanism,  a  sender  MM  can  send  data 
packets  asynchronously  to  the  ACKs  within  a  certain  sliding  window  of  w  packets,  before  the 
ACKs  for  previous  packets  have  been  received.  Accordingly,  a  receiver  node  can  send  one  single 
ACK  for  all  the  w  packets  in  the  previous  sliding  window  to  reduce  communication  overhead. 

6.7.2  Aggregate  Monitoring 

In  aggregate  monitoring,  packet  forwarding  at  each  hop  is  divided  into  consecutive  monitoring 
intervals,  which  are  asynchronous  among  network  nodes.  A  monitoring  interval  from  A  to  B 
refers  to  the  aggregate  monitoring  for  packets  sent  from  A  to  B  in  that  interval. 

Different  from  per-packet  monitoring  where  MMg  starts  a  timer  and  expects  an  immediate  ACK 
from  MMa  for  each  packet  sent  from  MMg  to  MM^,  in  aggregate  monitoring,  MMg  increments 
a  local  monitoring  counter  CBA  for  each  packet  sent  to  A.  Our  key  observation  is  that,  due  to 
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C|A  monitoring  interval  from  S  to  A  CgA,  CgA 


s 

► 

A 

< 

Cls,  CSAS  monitoring  interval  from  A  to  S  CsA 

Figure  6.2:  Router  state  in  TrueNet  aggregate  monitoring:  three  counters  for  each  neighbor. 


packet  authentication  by  1-hop  MACs,  packet  count  becomes  a  verifiable  measure  of  the  packet 
payload  as  well ,  because  a  modified  packet  payload  will  result  in  an  invalid  MAC  and  cause  the 
packet  to  be  dropped  without  polluting  the  counter.  Correspondingly,  MM^  also  increments  a 
local  monitoring  counter  CBA  for  each  valid  packet  received  from  MM§;  and  increments  another 

_ ^4 

per-neighbor  counter  CSj 4  for  each  invalid  packet  received  from  A,  as  Figure  6.2  depicts.  These 
counters  can  later  be  compared  to  reflect  5sa  =  {<5 sa^sa}’  i-e-: 

(6-2)  5dSA  =  \C§A  -C£a\,  8gA  =  CgA 

Similarly,  MM4  sets  a  counter  CAB  for  the  next  hop  B,  and  this  process  recursively  builds  a 
trusted  chain  of  1-hop  aggregate  monitoring  over  the  entire  end-to-end  path,  while  each  node  only 
has  per-neighbor  state  (monitoring  counters). 

Periodically,  neighbors  exchange  local  monitoring  counters  in  a  “request-and-reply”  manner  to 
learn  Sab  for  each  link  Iab  and  accuse  any  link  with  5ab  larger  than  a  pre-set  accusation  threshold. 
Specifically,  each  monitoring  interval  consists  of  sending  N  packets  (e.g.,  10 1  packets).  MMg  counts 
the  number  of  packets  sent  in  each  monitoring  interval  I  from  S  to  A.  Each  time  N  packets  have 
been  sent  indicates  the  end  of  interval  I,  and  MMg  generates  a  counter  request  IZsa  including 
the  requester  S,  the  next-hop  requestee  A,  the  interval  number  /  to  prevent  replay  attacks,  and 
a  MAC  computed  for  the  next  hop  MM4.  Then  similar  to  per-packet  monitoring,  MMg  stores  / 
and  C§A,  starts  a  timer  to  wait  for  the  counter  report  from  MM4,  increments  the  interval  number 
I,  and  zeros  CgA  for  the  next  interval.  Finally,  the  request  IZsa  and  the  report  Asa  proceed  in 
the  same  way  as  in  per-packet  monitoring.  Based  on  the  received  Asa ,  MMg  can  calculate  5s a 
(Equation  6.2)  and  accuse  a  faulty  link  if  any. 
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6.7.3  Per-Packet  vs.  Aggregate  Monitoring 

Per-packet  monitoring  enables  instant  fault  localization  and  failure  recovery  by  re-transmitting  the 
corrupted  packets  immediately,  at  the  cost  of  an  additional  ACK  per  packet  (or  per  w  packets 
in  a  sliding  window)  on  each  link.  Aggregate  monitoring  reduces  the  communication  overhead  by 
sending  one  counter  report  for  all  the  packets  in  each  monitoring  interval  (with  N  packets),  at  the 
cost  of  additional  fault  localization  delay  (one  monitoring  interval). 

In  TrueNet,  per-packet  monitoring  is  used  to  protect  critical  control-plane  messages,  e.g.,  the 
router  configuration  messages  from  the  network  administrator  to  each  router  as  we  mentioned 
earlier,  global  accusation  message  via  trustworthy  broadcasting  as  we  show  in  Section  6.8,  or  flow 
setup  packets  in  TCP.  Accordingly,  aggregate  monitoring  would  be  used  to  protect  line-rate  data 
packets  for  the  sake  of  lower  overhead,  and  the  network  can  rely  on  transport  layer  protocols  (such 
as  TCP)  for  retransmitting  and  recovering  the  lost  or  corrupted  packets  on  an  end-to-end  basis. 


6.8  TrueNet  Trustworthy  Broadcasting 

TrueNet  trustworthy  broadcasting  achieves  reachability ,  integrity,  and  trustworthiness  of  the  broad¬ 
casted  message.  Specifically,  when  a  certain  node  O  broadcasts  a  certain  message  m,  (i)  every  node 
in  the  network  will  receive  the  message  as  long  as  the  malicious  nodes  do  not  cause  a  graph  partition 
in  the  network  topology  (reachability),  (ii)  the  broadcast  message  received  by  each  node  is  the 
same  as  the  original  one  (integrity),  (iii)  and  the  broadcast  message  is  trusted,  e.g.,  the  accused 
link  is  indeed  faulty  (trustworthiness). 

TrueNet  trustworthy  broadcasting  is  built  on  top  of  per-packet  monitoring  to  achieve  the  above 
security  properties.  When  a  node  O  originates  a  broadcast  message  m,  it  uses  per-packet  monitor¬ 
ing  (Table  6.1)  to  convince  O' s  neighbors  that  the  message  has  not  been  modified  from  the  original 
one  thus  preserving  integrity,  and  the  message  is  generated  by  the  correct  monitoring  module  thus 
preserving  trustworthiness.  Figure  6.3  shows  an  example  of  how  a  broadcast  message  propagates  us¬ 
ing  per- hop  monitoring  (not  showing  the  ACKs).  The  per-packet  authenticated  ACK  in  per-packet 
monitoring  assures  a  sender  that  its  neighbors  have  received  the  correct  message  thus  achieving 
reachability,  also  run  the  correct  monitoring  modules,  and  thus  will  faithfully  keep  broadcasting 
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Figure  6.3:  TrueNet  trustworthy  broadcasting  example.  Node  O  is  the  originator  of  the  broadcast 
message  and  other  nodes  use  per-packet  monitoring  to  protect  the  broadcast  message. 

the  message  to  their  neighbors  and  so  on. 

Duplicate  suppression.  Numerous  methods  exist  to  ensure  that  the  broadcast  message  traverses 
each  link  only  a  single  time.  Due  to  limited  space  we  defer  detailed  protocol  design  and  analysis  to 
future  work.  However,  a  simple  method  for  suppressing  duplicate  broadcast  messages  is  for  each 
MM  to  keep  state  to  detect  duplicate  messages  it  may  later  receive.  To  recover  the  state,  messages 
can  contain  time  stamps  and  nodes  can  be  loosely  time  synchronized,  thus  only  requiring  storage 
for  the  maximum  clock  skew  plus  the  maximum  duration  for  the  message  to  reach  all  nodes. 

Global  accusation.  Once  a  node’s  MM  detects  faults,  the  MM  generates  an  accusation  and  dis¬ 
seminates  it  inside  certain  network-wide,  periodic  beacon  messages,  such  as  the  periodical  routing 
updates  (or  link  state  announcements  in  link  state  routing)  or  keep-alive  messages  between  neigh¬ 
bors.  In  TrueNet,  each  router  R' s  MM^  expects  to  receive  every  neighbor’s  beacon  after  every  t 
seconds,  otherwise  MM#  accuses  its  neighbor  which  does  not  send  a  beacon  on  time  (hence  a  mali¬ 
cious  router  OS  cannot  prevent  the  locally  generated  accusations  from  being  sent  to  its  neighbors). 
A  beacon  from  a  neighbor  MM#r  contains  any  accusation  generated  by  MM#r  and  is  protected  using 
per-packet  monitoring.  If  a  beacon  from  MM#r  contains  an  accusation,  this  beacon  automatically 
becomes  a  broadcast  message  and  is  further  propagated  using  the  trustworthy  broadcasting. 
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6.9  TrueNet  Fault  Localization  Analysis 

This  section  analyzes  TrueNet  fault  localization  delay,  security  and  overhead,  while  Section  6.11 
presents  real-field  implementation  and  evaluation. 

6.9.1  Fault  Localization  Delay 

The  fault  localization  delay  in  per-packet  monitoring  equals  the  packet  re-transmission  time  r:  only 
when  all  r  +  1  packets  fail  to  be  delivered  (failure  recovery  fails)  will  a  link  accusation  be  made. 
Theorem  19  states  the  lower  bound  of  r.  During  aggregate  monitoring,  at  the  end  of  a  monitoring 
interval,  a  router  A  can  learn  the  accurate  5ab  for  each  local  link  Iab  (thus  achieving  aggregate  fault 
localization),  regardless  of  the  interval  length  N  (number  of  packets  sent  in  that  interval).  Hence, 
the  value  of  N  is  set  based  on  the  desirable  tradeoff  between  detection  delay  and  communication 
overhead.  For  example,  a  smaller  N  enables  faster  detection  but  increases  the  number  of  counter 
reports  (one  report  required  for  every  N  packets).  Furthermore,  since  faulty  links  are  defined  and 
detected  based  on  the  accusation  threshold,  the  value  of  N  is  also  determined  by  the  accuracy  of 
the  threshold-based  faulty  link  accusation.  Specifically,  a  too  small  N  will  introduce  considerable 

fid- 

noise  in  the  observed  link  loss  rate,  given  by  due  to  the  existence  of  spontaneous  packet  loss. 
Theorem  19  states  the  lower  bound  of  N  for  achieving  a  sufficiently  high  accusation  accuracy. 

Theorem  19.  Suppose  the  natural  packet  drop  rate  is  p  on  a  link,  the  accusation  threshold  T*.  = 
p  +  e  where  T,ir  e  (0, 1 ) 1 ,  and  the  allowed  false  positive  and  negative  rate  is  o.  Then  the  fault 
localization  delay  or  packet  re-transmission  time  for  failure  recovery  in  per-packet  monitoring  is  at 
least  r  =  —  1 .  The  fault  localization  delay  or  a  monitoring  interval  length  is  at  least  N  =  ■ 

Proof.  We  assume  each  link  has  a  natural  drop  rate  p. 

Per-packet  monitoring.  The  probability  that  a  benign  link  “naturally”  drops  all  r  +  1  packets 
(including  r  re-transmissions),  or  the  false  positive  fp ,  is  given  by  fp  =  pr+1.  Since  we  require 
fp  <  <7,  we  have  r  >  ^  —  1. 

lr[b  simplify  the  mathematical  formula,  we  denote  Tdr  as  a  fraction  of  packets  dropped,  instead  of  the  absolute 
number  of  dropped  packets  as  the  original  Td  denotes. 
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Aggregate  monitoring.  We  study  how  many  packet  transmissions  are  required  to  estimate  the 
drop  rate  of  a  single  link  Uj  within  a  certain  accuracy  interval.  Suppose  that  the  true  value  of  the 
drop  rate  of  kj  is  9*j,  and  the  estimated  drop  rate  of  Uj  is  9j.j.  We  compute  the  number  of  packets 
needed  to  achieve  a  (e,  cr)-accuracy  for  9^: 


(6.3)  Pr(\9ij  —  9*j\  >  e)  <  a 

i.e. ,  with  probability  l  —  a  the  estimated  9^  is  within  (6*j  —  e,  0*-  +  e).  We  define  each  time  a  data 
packet  is  sent  over  link  Ijj  as  a  random  trial,  and  thus  each  monitoring  interval  has  N  random 
trials.  Then  using  Hoeffding ’s  inequality ,  we  have: 

(6.4)  Pr(\9ij-9*j\>e)<2e~2m2 


Then  by  Equation  6.3,  we  have: 

(6.5)  2e~2N(2  <a^N> 


Since  e 


Td 


p ,  we  further  have:  N  > 


In  ( — ) 

2( Td-P p 


Q 


Finally,  the  network-wide  faulty  link  detection  process  is  accelerated  in  TYueNet  since  a  faulty 
link  detected  by  one  node  will  be  removed  from  the  routing  tables  of  all  other  nodes;  whereas  in 
existing  protocols  a  node  cannot  share  others’  accusation  because  of  slander  attacks. 


6.9.2  Security  analysis 

TrueNet  achieves  per-packet  and  aggregate  fault  localization  via  per-packet  and  aggregate  monitor¬ 
ing,  respectively.  Recall  that  the  adversary  can  drop,  modify,  inject,  replay,  re-order,  and  misroute 
packets  at  links  under  control. 


Per-packet  fault  localization.  Packet  dropping,  modification,  and  injection  attacks  between 
MM^  and  MMg  will  cause  MM^  or  MM#  to  fail  to  generate  authentic  ACKs  for  the  original 
packets;  thus  the  link  Iab  that  corrupts  the  packets  will  be  localized.  Packet  replay  and  re-ordering 
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attacks  from  MM^  to  MM^  will  cause  packets  to  be  dropped  at  MMg  thanks  to  the  use  of  per- 
neighbor  sequence  numbers,  because  MMg  stores  and  only  expects  a  packet  with  the  most  recent 
per-neighbor  sequence  number.  Finally,  packet  misrouting  attacks  are  impossible  because  the  source 
embeds  the  expected  path  p  in  the  packets,  and  routers  will  perform  next-hop  checking  based  on 
the  path  and  will  drop  any  packets  that  are  misrouted. 

Aggregate  fault  localization.  Without  loss  of  generality,  we  consider  a  monitoring  interval 
from  A  to  B  for  example.  Upon  receiving  the  counters  Cjj \B  and  Cab  from  B  (otherwise  MM^ 
can  immediately  accuse  Iab  for  not  sending  a  correct  counter  report),  MM^  can  first  be  convinced 
that  the  counter  values  were  reported  by  the  correctly  running  MM^  and  are  thus  correct.  Then 
MM^  can  estimate  5ab  and  detect  any  fault.  Similar  to  the  analysis  of  per-packet  fault  localization 
above,  packet  dropping  will  increase  8^B,  and  packet  modification,  injection,  replay,  re-ordering, 
and  misrouting  will  increase  8^ B. 

We  give  one  interesting  note  about  packet  misrouting  attack  using  Figure  6.1  as  an  example 
topology.  The  malicious  node  B  can  first  misroute  the  packets  to  a  colluding  neighbor  C'  (not 
shown  in  the  figure),  which  then  transparently  forwards  the  packet  back  to  C  (the  legitimate  next 
hop  of  B  in  path  p)  without  passing  the  packet  through  MMc  ■  TrueNet  treats  this  as  a  legitimate 
case  which  does  not  violate  aggregate  fault  localization,  because  in  the  logical  protected  path  the 
packets  still  traverse  from  MMg  to  MMc  hi  order.  This  packet  detouring  is  only  possible  between 
colluding  neighbors  which  can  be  treated  as  one  logical  malicious  entity,  and  is  akin  to  detouring 
packets  inside  the  same  malicious  router. 

6.9.3  Overhead  Analysis 

Storage  overhead.  We  focus  on  the  router  state  required  for  per-packet  processing  which  needs 
to  reside  in  on-chip  memory  or  cache  and  usually  becomes  the  system  scalability  bottleneck.  A 
router  state  in  TrueNet  includes  (i)  per-neighbor  secret  keys  (e.g.,  16  bytes  per  neighbor)  for 
both  per-packet  and  aggregate  monitoring,  and  (ii)  three  monitoring  counters  (e.g.,  3x8  bytes) 
in  aggregate  monitoring  as  Figure  6.2  shows.  Since  per-packet  monitoring  is  used  for  infrequent 
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(compared  to  the  link  rate)  packets,  such  state  can  be  either  stored  in  the  adequate  off-chip  DRAM, 
or  stored  in  a  small  cache  (storing  up  to  w  packets  in  a  sliding  window  at  any  time). 

Communication  overhead.  The  extra  communication  overhead  in  TrueNet  per-packet  moni¬ 
toring  includes  one  ACK  per  packet  or  per  sliding  window  with  w  packets.  The  communication 
overhead  in  aggregate  monitoring  is  one  counter  report  per  monitoring  interval  (e.g.,  with  104 
packets).  When  per-packet  monitoring  is  only  used  for  protecting  infrequent  (compared  to  the 
line  rate)  control  messages  such  as  flow  setup  in  TCP  and  link-state  routing  updates,  the  extra 
communication  overhead  amortized  on  each  data  packet  is  small. 

6.10  TrueNet  Router  Architecture 

We  present  a  TrueNet  router  architecture  leveraging  a  dedicated  hypervisor  and  TPM  chip  to 
implement  the  trusted  computing  primitives  (remote  attestation,  isolation,  and  sealed  storage), 
and  modern  mainstream  router  hardware  to  speed  up  time-critical  operations  in  TrueNet. 

Anatomy  of  a  TrueNet  router.  Modern  routers  commonly  use  a  switch-based  router  archi¬ 
tecture  with  fully  distributed  processors  [20]  and  the  network  interfaces  perform  almost  all  the 
critical  data-path  operations  for  a  normal  packet.  Figure  6.4  shows  the  architecture  of  a  TrueNet 
router,  where  the  shaded  components  are  those  added  in  a  TrueNet  router  but  not  present  in  a 
standard  modern  router  and  also  constitute  the  TCB  for  TrueNet.  As  Figure  6.4  shows,  each 
TrueNet  router  is  equipped  with  a  TPM  chip  and  CPUs  with  hardware  virtualization  support  (e.g., 
AMD  SVM  [10],  or  Intel  TXT  [44]),  and  installs  a  dedicated  hypervisor  such  as  TrustVisor  [69]. 
The  dedicated  hypervisor  isolates  MM  from  the  rest  of  the  router  system  (e.g.,  router  OS,  periph¬ 
eral  devices,  etc.),  enables  remote  attestation  and  sealed  storage  with  the  support  of  TPM  chip, 
and  protects  MM’s  execution  integrity,  data  integrity  and  secrecy.  Similar  to  TrustVisor  [69],  the 
TPM  operations  are  only  needed  when  the  dedicated  hypervisor  boots  to  ensure  the  hypervisor’s 
integrity,  while  afterwards  the  dedicated  hypervisor  performs  attestation  and  storage  sealing  to 
improve  the  efficiency. 
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Figure  6.4:  TrueNet  router  architecture. 


For  better  performance,  we  anticipate  on  every  network  interface,  there  is  a  trusted  hardware 
MAC  Module  (MACM)  to  perform  the  MAC  operations  in  MM  as  described  earlier.  A  MACM 
has  a  piece  of  private  memory  space  and  a  high-speed  MAC  computation  module.  The  private 
memory  of  MACM  is  mapped  to  the  main  memory  residing  in  the  CPU  subsystem,  and  shared 
with  the  local  MM.  The  dedicated  hypervisor  also  protects  this  piece  of  main  memory  from  the 
rest  of  the  CPU  subsystem,  so  that  only  the  MM  can  read  from  and  write  to  this  main  memory 
region.  However,  MACM  can  also  be  implemented  inside  the  software  MM  as  we  described  earlier, 
which  we  used  for  our  prototyping  (Section  6.11.1). 

Software  monitoring  module  MM.  A  MM  handles  all  control-plane  operations  that  are  not 
time-critical,  or  infrequent  in  a  TrueNet  system.  First,  the  local  MM  negotiates  secret  keys  with 
the  MMs  on  the  neighboring  routers,  and  writes  the  secret  keys  into  the  main  memory  region  that 
maps  the  private  memory  of  MACM.  Secret  key  negotiation  only  happens  periodically  according  to 
the  cryptographic  key  lifetime.  Secondly,  MMs  on  the  source  nodes  also  handle  packets  originating 
from  the  connected  end-hosts  by  adding  the  entire  routing  path  into  the  packets  (Section  6.7)  for  1- 
hop  monitoring.  Thirdly,  the  MM  is  also  responsible  for  generating  accusations  to  be  broadcasted 
in  the  beacon  messages.  In  addition,  MM  also  periodically  checks  the  locally  stored  timers  for 
awaiting  ACKs  from  neighbors  to  detect  and  remove  any  expired  entries. 
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Dedicated  MAC  module.  The  dedicated  MAC  module  (MACM)  is  responsible  for  all  data- 
plane  operations  to  achieve  high  packet  processing  throughput.  A  MACM  verifies  the  MAC  in  the 
packet,  validates  the  correct  presence  of  the  local  router  in  the  embedded  path,  computes  the  new 
MAC  using  the  shared  secret  key  for  the  next  hop  router,  updates  per-neighbor  sequence  numbers 
and  monitoring  counters,  and  attaches  the  new  MAC  to  the  packet  on  a  per-packet  basis.  To  achieve 
high  throughput  in  MAC  computation,  We  can  use  parallelizable  MAC  algorithms  such  as  XOR- 
MAC  [22],  XECB-MAC  [34],  PMAC  [23],  or  high  speed  hardware  implementations  [93,  81,  62,  86] 
which  can  obtain  more  than  62.6  Gbps  throughput. 

6.11  Implementation  and  Evaluation 

In  this  section,  we  evaluate  both  TrueNet’s  computational  overhead  based  on  our  Linux  prototype 
of  a  TrueNet  router  and  TrueNet’s  storage  overhead  based  on  real-world  ISP  topologies  and  traffic 
traces.  We  show  that  even  when  implementing  MACM  inside  the  software  MM,  a  TrueNet  router 
can  achieve  gigabit  line  rate  with  only  commodity  multi-core  support,  and  the  state  in  a  TrueNet 
router  is  up  to  five  orders  of  magnitude  less  than  in  related  work  [97,  21]. 

6.11.1  Prototype  and  Computational  Overhead 

We  implement  a  TrueNet  router  prototype  in  Linux  with  TPM  chip  to  evaluate  per-packet  cryp¬ 
tographic  computational  overhead  of  a  TrueNet  router.  We  show  the  performance  of  a  TrueNet 
intermediate  router  which  performs  two  MAC  operations  per  packet  (verification  of  the  previous- 
hop  MAC  and  generation  of  the  next-hop  MAC)  inside  the  software  MM.  We  observe  that  the 
TrueNet  per-packet  cryptographic  operations,  even  implemented  in  TrueNet  software  module  MM 
without  any  hardware  acceleration,  can  fully  cope  with  gigabit  link-rate  processing  of  data  pack¬ 
ets,  and  are  fully  scalable  to  higher  performance  with  more  CPUs.  We  anticipate  the  dedicated 
hardware  MACM  (Section  6.10)  can  further  boost  the  TrueNet  router  throughput. 

Platform.  We  performed  all  experiments  on  off-the-shelf  servers  with  one  Intel  Xeon  E5640  CPU 
(four  2.66  GHz  cores,  256KB  LI  cache,  1MB  L2  cache,  12MB  L3  cache),  12G  DDR3  RAM  with 
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25.6  GB/s  memory  bandwidth.  This  CPU  supports  new  Intel  AES-NI  instructions  [45]  for  high 
speed  AES  computation.  The  servers  are  equipped  with  TPM  chips  and  Broadcom  NetXtreme  II 
BCM5709  Gigabit  Ethernet  Interface  Cards,  and  runs  Ubuntu  10.04  32-bit  Desktop  OS. 

Prototype.  In  our  TrueNet  prototype,  we  modify  TrustVisor  [69]  as  our  dedicated  hypervisor. 
We  run  Ubuntu  Linux  OS  on  top  of  our  hypervisor  and  implement  a  TrueNet  intermediate  router 
as  a  multi-threaded  user-space  process.  A  TrueNet  router  process  includes  the  secure  software 
module  MM  and  untrusted  network  stack.  The  untrusted  network  stack  consists  of  two  threads:  a 
receiver  thread  that  listens  to  network  packets  via  TUN/TAP  virtual  interfaces  and  puts  received 
packets  to  an  input  packet  queue,  and  a  forwarder  thread  in  charge  of  sending  the  packets  in  the 
output  packet  queue  to  their  appropriate  next-hop  routers.  Multiple  MMs  run  as  child  threads, 
constantly  poll  the  input  packet  queue,  copy  the  new  incoming  packets  to  a  shared  output  packet 
queue,  and  perform  MAC  computations.  We  use  the  CMAC- AES-128  MAC  algorithm  to  leverage 
the  new  AES-NI  instructions  on  Intel  CPUs. 

Our  software  module  MM  performs  similar  per-packet  cryptographic  operations  as  the  hardware 
module  MACM  proposed  in  Section  6.10  in  software  manner,  while  maintaining  same  security 
guarantees.  The  MM  child  threads  are  running  inside  the  secure  and  isolated  execution  environment 
provided  by  dedicated  hypervisor  ever  since  threads  start.  The  dedicated  hypervisor  also  protects 
the  memory  region  of  input  packet  queue  as  accessible  by  both  the  untrusted  network  stack  and 
MMs,  and  the  output  packet  queue  as  writable  by  MMs  but  only  readable  by  untrusted  network 
stack.  This  memory  configuration  assures  MM’s  execution  integrity.  Finally,  the  TPM  securely 
boots  and  late-launches  the  dedicated  hypervisor  to  guarantee  its  integrity,  as  described  in  the 
TrustVisor  proposal  [69]. 

Throughput  and  Latency  Breakdown.  We  tested  the  throughput  of  our  software  TrueNet 
router  prototype  using  the  widely  adopted  network  performance  benchmarking  tool  Netperf  [4], 
Figure  6.5  shows  the  test  result.  The  baseline  performance  in  the  figure  is  obtained  by  using 
a  main  thread  to  receive  packets,  two  MM  threads  to  move  packets  to  the  output  packet  queue 
without  any  other  operations ,  and  one  forwarder  thread  to  send  packets  out  to  next-hop  routers.  For 
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packet  size  (bytes) 


Figure  6.5:  TrueNet  router  throughput. 
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0.4 

Others 

1.3 

1.1 

0.7 

1.1 

Total 

5.5 

4 

2.3 

1.5 

Table  6.2:  TrueNet  software  module  MM’s  latency  overhead  breakdown.  All  the  data  is  the  average 
time  (microseconds)  in  50000  packet  processing  trials. 

TrueNet  prototype,  the  test  setting  is  similar  to  baseline  performance  test  with  the  only  difference 
that  MM  threads  perform  TrueNet  packet  validation  and  MAC  computations  for  every  packet.  As 
Figure  6.5  shows,  TrueNet  prototype  incurs  negligible  throughput  degradation  when  compared  with 
the  baseline  throughput  (maximum  degradation  in  our  test  is  (817-789)/817=4.5%  when  packet 
size  is  1024  bytes,  most  degradation  rates  are  under  2%). 

We  also  shows  a  latency  overhead  breakdown  of  executing  software  module  MM’s  per-packet 
process.  From  Table  6.2,  we  know  that,  leveraging  the  new  AES-NI  instruction,  MAC  computations 
are  highly  efficient  (on  average  3  CPU  cycles  per  packet  byte).  In  our  prototype,  AES  key  setup 
time  is  negligible  since  each  TrueNet  router  only  needs  to  hold  one  session  key  per  neighboring 
router  in  a  session  key  life  time,  and  we  can  pre-compute  all  AES  sub-keys. 
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(India)  stra 

Figure  6.6:  Key  storage  overhead  of  a  single  router  on  ISP  topologies. 

6.11.2  Storage  Overhead  Measurement 

TrueNet’s  ability  to  deliver  strong  security  properties  (instant  failure  recovery  with  per-packet  fault 
localization,  global  accusation,  etc)  with  less  state  than  previous  attempts  [21,  97]  follows  logically. 
Still,  measurements  under  real-world  conditions  provide  an  exact  assessment  of  TrueNet’s  strength. 

Rocketfuel-based  measurements.  The  Rocketfuel  topologies  [84]  of  various  top-tier  ISPs  ex¬ 
tend  from  the  ISPs’  peering  routers  to  approximately  the  first  hop  within  a  customer’s  network. 
We  count  the  node  degree  for  each  router  in  the  topology  to  assess  TrueNet’s  overhead  and  com¬ 
pare  it  to  the  number  of  nodes  in  the  network,  representing  the  recently  proposed  Statistical  fault 
localization  [21]  and  PAAI  [97]  ’s  key  storage  overhead.  Figure  6.6  suggests  that  TrueNet  incurs 
on  average  two  orders  of  magnitude  less  overhead  in  the  worst  case  (considering  the  maximum 
node  degree  in  the  topologies),  and  three  order  less  overhead  for  the  average  case  (considering  the 
average  node  degree). 

Internet2-based  measurements.  The  Internet2  provides  similar  topology  data  for  its  core 
routers,  which  Figure  6.6  also  illustrates  (labeled  as  “12”).  Since  this  topology  only  includes  core 
routers,  TrueNet  does  not  deliver  the  orders  of  magnitude  less  overhead  achieved  with  the  Rocketfuel 
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Figure  6.7:  Overhead  comparison  based  on  Internet2  topology  and  traffic  traces. 


topologies,  providing  an  83%  savings  in  the  average  case  and  69%  in  the  worst  case.  Conveniently, 
the  Internet2  also  provides  Netflow  data,  allowing  for  measurement  of  TrueNet’s  and  Statistical 
fault  localization’s  monitoring  state  overhead.  These  Netflow  files  capture  1/100  packets  seen  over 
a  five  minute  interval.  In  Statistical  fault  localization,  the  router  incurs  an  around  500-byte  “secure 
sketch”  [36]  for  each  path  (identified  as  each  unique  source  and  destination  in  our  measurement). 
In  contrast,  a  TrueNet  router  maintains  three  counters  (24  bytes)  for  each  neighbor.  Figure  6.7 
shows  that  TrueNet  requires  approximately  five  orders  of  magnitude  less  monitoring  state  overhead. 
Additionally,  these  flow  data  allow  for  a  more  accurate  estimation  of  key  storage  overhead  in 
Statistical  fault  localization  (number  of  sources  with  traffic  concurrently  traversing  the  same  router), 
also  shown  in  Figure  6.7  (the  key  storage  overhead  in  TrueNet  is  still  one  key  per  neighbor). 


6.12  Discussion 

6.12.1  Incremental  Deployment 

Although  we  argue  it  is  feasible  to  upgrade  all  routers  with  trusted  computing  primitives  within  a 
single  administrative  domain,  we  note  that  partial  deployment  of  TrueNet  can  still  benefit  the  early 
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Figure  6.8:  Incremental  deployment  of  TrueNet.  The  shaded  nodes  have  deployed  TrueNet  and 
form  logical  trust  links  between  each  other. 

adopters.  Specifically,  when  only  a  subset  of  routers  in  a  network  are  equipped  with  TrueNet,  the 
monitoring  modules  still  constitute  logical  protected  paths  where  a  logical  protected  link  between 
two  MMs  may  consist  of  multiple  physical  links.  Figure  6.8  shows  an  example  where  the  shaded 
nodes  have  deployed  TrueNet  and  a  logical  protected  link  consists  of  Iab  and  Ibc ■  Hence,  fault 
localization  is  still  achieved  on  each  logical  protected  link  (though  not  an  exact  physical  link) ,  which 
helps  localizing  the  failure  to  a  bounded  region  and  facilitates  network  diagnosis.  Furthermore,  the 
more  densely  the  MMs  are  deployed,  the  more  accurate  the  failure  localization  can  be,  which  incents 
incrementally  deploying  TrueNet. 

6.12.2  Interdomain  Deployment 

TrueNet  mainly  targets  intra-domain  networks  such  as  ISP  and  enterprise  networks,  where  so¬ 
phisticated  hardware  attacks  can  be  precluded  since  the  remote  attacker  (the  adversary  model 
we  considered)  does  not  have  physical  access  to  the  routers.  However,  it  is  ineffective  to  deploy 
TrueNet  in  the  current  inter-domain  setting  where  each  Autonomous  System  (AS)  represents  a 
node  in  TrueNet,  because  a  selfish  or  malicious  AS  has  physical  access  to  its  routers  and  can  thus 
subvert  the  hardware  (e.g.,  TPM  chips)  upon  which  trusted  computing  primitives  rely.  Fortunately, 
the  recently  proposed  SCION  [96]  inter-domain  architecture  groups  the  ASes  into  different  trust 
domains ,  within  which  strong  contractual  or  legislative  regulation  can  be  enforced.  Hence,  an  AS 
tampering  with  the  hardware  can  be  legally  penalized  by  the  containing  trust  domain.  This  ar¬ 
chitecture  naturally  enables  the  wide  deployment  of  TrueNet  (or  trusted  computing  primitives  in 
general)  across  different  ASes  within  a  trust  domain.  Meanwhile,  TrueNet  also  serves  as  an  example 
of  how  to  technically  achieve  enforceable  accountability  within  a  trust  domain  in  SCION. 


110 


CHAPTER  6.  TRUENET 


6.13  Summary 

In  this  chapter,  we  demonstrate  that  trusted  computing  enables  transitivity  of  verification  and 
eliminates  the  need  of  establishing  direct  point-to-point  trust  between  any  two  nodes  in  the  network 
which  incurs  high  storage  overhead  and  obstructs  key  management.  TrueNet  employs  only  a  small 
TCB  to  achieve  secure  fault  localization  with  small  router  state,  dynamic  path  support,  and  global 
accusation  that  are  proven  impossible  in  traditional  networks.  Though  achieving  much  smaller 
protocol  overhead  compared  to  path-based  fault  localization  approaches  (PAAI  and  ShortMAC), 
TrueNet  requires  special  hardware  support  (such  as  TPM  chips  and  hardware  virtualization)  and 
is  vulnerable  to  hardware  attacks.  In  the  next  chapter,  we  present  a  1-hop-based  fault  localization 
protocol  with  small  overhead  without  relying  on  trusted  computing. 


Chapter  7 


DynaFL 


Like  PAAI  and  ShortMAC,  most  existing  secure  fault  localization  protocols  are  path-based, ,  which 
assume  that  the  source  node  knows  the  entire  outgoing  path  that  delivers  the  source  node’s  packets 
and  that  the  path  is  static  and  long-lived.  However,  these  assumptions  are  incompatible  with  the 
dynamic  traffic  patterns  and  agile  load  balancing  commonly  seen  in  modern  networks.  To  cope 
with  real-world  routing  dynamics,  we  propose  the  first  secure  neighborhood-based  fault  localization 
protocol,  DynaFL,  with  no  requirements  on  path  durability  or  the  source  node  knowing  the  outgoing 
paths.  DynaFL  aims  to  localize  data-plane  faults  to  a  1-hop  neighborhood,  instead  of  a  specific 
link.  Unlike  TrueNet,  DynaFL  requires  no  special  hardware  support.  Through  a  core  technique 
we  named  delayed  function  disclosure  ,  DynaFL  incurs  little  communication  overhead  and  a  small, 
constant  router  state  independent  of  the  network  size  or  the  number  of  flows  traversing  a  router. 
In  addition,  each  DynaFL  router  maintains  only  a  single  secret  key,  which  is  100  to  10000  times 
fewer  than  in  path-based  fault  localization  protocols  based  on  our  measurement  results. 

7.1  Introduction 

Existing  fault  localization  protocols  that  are  secure  against  sophisticated  packet  dropping  and 
modification  attacks  [17,  97,  21]  require  that  the  sender  know  the  entire  path  that  delivers  the  source 
node’s  packets,  and  that  the  path  be  long-lived  (e.g.,  stable  over  transmitting  108  packets  [21])  to 
obtain  a  statistically  accurate  fault  localization.  However,  recent  measurement  studies  [16,  38,  31] 
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show  that  a  considerable  fraction  of  current  network  flows  are  short-lived  “mice”  and  routing  paths 
are  highly  dynamic.  Furthermore,  emerging  enterprise  and  datacenter  networks  call  for  more  agile 
load  balancing  and  dynamic  routing  paths.  For  example,  a  recently  proposed  datacenter  routing 
architecture,  VL2  [38],  employs  Valiant  Load  Balancing  to  spread  traffic  uniformly  across  network 
paths  via  random  packet  deflection.  In  this  case,  the  actual  routing  path  is  determined  on  the 
fly  during  forwarding  and  thus  cannot  be  predicted  and  known  by  the  sender.  Given  the  conflict 
between  the  “static-path”  assumption  and  the  “dynamic-path”  reality,  researchers  have  concluded 
that  existing  fault  localization  protocols  are  impractical  for  widespread  deployment  in  large-scale 
networks  with  dynamic  traffic  patterns  [21]. 

In  addition,  in  existing  secure  fault  localization  protocols,  a  router  must  share  some  secret  (e.g., 
cryptographic  keys)  with  each  source  node  sending  traffic  traversing  that  router,  making  the  key 
storage  overhead  at  an  intermediate  router  linear  in  the  number  of  end  nodes.  The  proliferation  of 
key  copies  shared  by  routers  with  all  end  nodes  under  non-uniform  (and  generally  poor)  adminis¬ 
tration  also  increases  the  risk  of  key  compromise  thereby  enabling  undetected  attacks.  In  existing 
secure  fault  localization  protocols,  a  router  also  needs  to  maintain  per-path  state  for  each  path 
traversing  that  router,  making  the  fault  localization  unscalable  for  large-scale  networks. 

We  aim  to  bridge  the  current  gap  between  the  security  of  fault  localization  against  strong 
adversaries  and  the  ability  to  support  dynamic  traffic  patterns  in  modern  networks  such  as  ISP, 
enterprise,  and  datacenter  networks.  More  specifically,  the  desired  fault  localization  protocol  should 
be  secure  against  sophisticated  packet  dropping,  modification,  fabrication,  and  delaying  attacks  by 
colluding  routers,  while  retaining  the  following  properties: 

•  Path  obliviousness:  A  source  node  or  a  router  does  not  need  to  know  the  outgoing/downstream 
path. 

•  Volatile  path  support:  The  fault  localization  protocol  requires  no  duration  time  for  a 
forwarding  path. 

•  Constant  router  state:  A  router  does  not  need  to  maintain  per-path,  per-flow,  or  per- 


source  state. 
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Figure  7.1:  Path-based  fault  localization.  TSr  denotes  the  traffic  summary  generated  by  router  r. 
For  brevity,  “TSa/TSb”  refers  to  “TSyi  deviates  from  TS#  more  than  a  certain  threshold”. 


•  0(1)  key  storage:  A  router  only  manages  a  small  number  of  keys  regardless  of  the  network 
size. 

Path  obliviousness  and  volatile  path  support  together  enable  agile  (e.g.,  packet-level)  load  balancing 
and  dynamic  routing  paths  (e.g.,  Valiant  load-balanced  paths).  These  two  properties  also  decouple 
the  data-plane  fault  localization  from  routing,  thus  enabling  it  to  support  a  wide  array  of  routing 
protocols.  Finally,  constant  router  state  provides  scalability  in  large-scale  networks  and  0(1)  key 
storage  reduces  the  security  risk  due  to  key  compromise. 

We  observe  that  the  “static-path”  assumption  in  existing  secure  fault  localization  protocols 
stems  from  the  fact  that  those  fault  localization  protocols  operate  on  entire  end-to-end  paths 
( path-based ),  to  localize  the  fault  to  one  specific  link.  As  Figure  7.1  shows,  each  router  maintains 
a  certain  “traffic  summary”  (e.g.,  a  counter,  packet  hashes,  etc.)  for  each  path  that  traverses  the 
router  (thus  requiring  per-path  state),  and  sends  the  traffic  summary  to  the  source  node  S  of  each 
path.  S  can  then  detect  a  link  l  as  malicious  if  the  traffic  summaries  from  Vs  two  adjacent  nodes 
deviate  greatly,  as  Figure  7.1  illustrates.  Hence,  S  needs  to  know  the  entire  path  topology  to 
compare  traffic  summaries  of  adjacent  nodes,  and  needs  to  send  a  large  number  of  packets  over  the 
same  path  so  that  the  deviation  in  traffic  summaries  can  reflect  a  statistically  accurate  estimation 
of  link  quality.  Finally,  to  authenticate  the  communication  between  the  source  and  each  router 
in  the  path,  a  router  needs  to  share  a  secret  key  with  each  source  that  sends  traffic  through  that 
router. 

In  this  paper,  we  explore  neighborhood-based  fault  localization  approaches,  where  a  router  r’s 
data-plane  faults  (if  any)  can  be  detected  by  checking  the  consistency  (or  conservation)  of  the 
traffic  summaries  generated  by  the  1-hop  neighbors  of  r  (denoted  by  N(r)  in  Figure  7.2).  That 
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is,  in  benign  cases,  the  packets  sent  to  r  will  be  consistent  with  the  packets  received  from  r  by 
all  of  r’s  neighbors  as  reflected  in  their  traffic  summaries.  In  this  way,  the  fault  localization  is 
independent  of  routing  paths  and  only  depends  on  1-hop  neighborhoods,  thus  supporting  arbitrary 
routing  protocols  and  dynamic  load  balancing.  Additionally,  each  router  in  a  neighborhood-based 
approach  only  needs  to  maintain  state  for  each  neighbor.  In  summary,  neighborhood-based  fault 
localization  localizes  faults  to  a  specific  1-hop  neighborhood  to  reduce  further  investigation,  to  trade 
localization  precision  for  practicality  in  modern  networks  with  dynamic  traffic  patterns. 

Though  promising,  neighborhood-based  fault  localization  is  susceptible  to  sophisticated  packet 
modification  and  collusion  attacks  due  to  several  security  and  scalability  challenges.  For  example, 
for  the  sake  of  scalability,  the  traffic  summary  cannot  be  a  copy  of  all  the  original  packets  (or 
even  their  hashes),  but  have  to  be  a  compact  representation  of  the  original  packets  via  a  certain 
fingerprinting  function  T .  On  one  hand,  if  T  generates  traffic  summaries  at  different  nodes 
without  using  different  secret  keys,  a  malicious  router  can  predict  the  outputs  of  JF  at  other  nodes 
and  tactically  modify  packets  such  that  the  outputs  of  T  will  stay  the  same  as  with  the  original 
packets.  On  the  other  hand,  if  T  at  different  nodes  uses  different  secret  keys,  we  cannot  compare 
and  run  consistency  check  over  different  nodes’  traffic  summaries.  To  address  these  challenges, 
we  propose  DynaFL,  a  protocol  that  employs  a  core  technique  called  delayed  function  disclosure  , 
which  discloses  the  same  key  for  computing  JF  to  different  routers  after  they  have  forwarded  the 
packets.  To  further  minimize  the  protocol  overhead,  DynaFL  employs  a  secure  sampling  mechanism 
also  based  on  the  delayed  function  disclosure,  so  that  a  malicious  router  cannot  know  if  a  packet 
is  sampled  or  not  at  the  time  it  forwards  (corrupts)  the  packet.  Finally,  a  router  in  DynaFL  only 
shares  a  secret  key  with  a  centralized  controller,  thus  achieving  0(1)  key  storage. 
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Contributions.  Our  contributions  are  three-fold: 

1.  We  raise  the  importance  of  pursuing  a  secure  fault  localization  design  to  cope  with  dynamic 
traffic  patterns  in  real-world  operational  networks  with  a  small,  constant  router  state  and  key 
storage. 

2.  To  the  best  of  our  knowledge,  DynaFL  is  the  first  secure  neighborhood-based  fault  localization 
protocol  that  achieves  path  obliviousness  and  volatile  path  support,  and  is  secure  against  both 
packet  loss  and  sophisticated  packet  modification/injection  attacks. 

3.  In  addition,  a  DynaFL  router  requires  only  about  4MB  per-neighbor  state  based  on  our  AMS 
sketch  [11]  implementation,  whereas  path-based  fault  localization  protocols  require  per-path  state. 
We  also  show  through  measurements  that  the  number  of  keys  a  router  needs  to  manage  in  path- 
based  fault  localization  protocols  is  100  to  10000  times  higher  than  that  in  DynaFL  (which  is  a  single 
key  shared  with  a  centralized  controller).  Finally,  our  simulation  results  demonstrate  DynaFL’s 
small  detection  delay  and  negligible  communication  overhead. 

7.2  Setting 

Besides  the  problem  formulation  described  in  Chapter  2,  we  introduce  additional  notation  and 
definitions  for  this  chapter  below. 

Notation.  We  denote  the  1-hop  neighborhood  (or  neighborhood,  for  brevity)  of  a  node  s  as  N(s), 
as  Figure  7.2  illustrates.  For  a  particular  packet  traversing  a  neighborhood  N(s),  the  neighbor 
sending  that  packet  to  node  s  is  called  an  ingress  node  in  N(s)  for  that  packet,  and  the  node 
receiving  that  packet  from  s  is  called  an  egress  node.  We  term  a  sequence  of  packets  as  a  packet 
stream  §.  Particularly,  we  denote  the  packet  stream  sent  from  node  i  to  node  j  as  ,  and  this 
packet  stream  is  seen  by  nodes  i  and  j  as  SpJ  and  respectively.  The  difference  of  two  packet 
streams  §  and  §*,  denoted  by  A(§,,§,))  refers  to  the  number  of  packets  in  one  packet  stream  but 
not  in  the  other,  without  considering  the  variant  IP  header  fields  such  as  the  TTL  and  checksum. 
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Network  Setting.  We  consider  a  network  with  dynamic  traffic  patterns  and  a  relatively  static 
network  topology,  which  is  best  exemplified  by  today’s  ISP,  enterprise,  and  datacenter  networks. 
To  provide  maximum  flexibility  to  support  various  routing  protocols,  and  even  packet-level  load 
balancing,  we  pose  no  restriction  on  the  routing  protocols  and  load  balancing  mechanisms  used 
in  the  network.  We  assume  a  trusted  administrative  controller  (AC)  in  the  network,  which 
shares  a  pairwise  secret  key  with  each  node  in  the  network.  As  we  will  show  later,  the  AC  is 
mainly  in  charge  of  analyzing  the  traffic  summaries  gathered  from  different  nodes  and  localizing 
any  neighborhood  with  data-plane  faults.  Finally,  we  require  nodes  in  the  network  be  loosely  time- 
synchronized,  e.g.,  on  the  order  of  milliseconds.  Loose  time  synchronization  represents  a  common 
requirement  for  detecting  packet  delaying  attacks  [71,  14,  15]  and  nowadays  even  high-precision 
clock  synchronization  is  available  given  the  advent  of  GPS-enabled  clocks  and  the  adoption  of 
IEEE  1588  [46]. 

7.2.1  Problem  Formulation 

Our  goal  is  to  design  a  practical  and  secure  neighborhood-based  fault  localization  protocol  to  identify 
a  suspicious  neighborhood  (if  any)  that  contains  at  least  one  malicious  node.  Recall  that  practi¬ 
cality  translates  to  path  obliviousness,  volatile  path  support  and  constant  router  state  as  stated  in 
Section  7.1.  We  further  adopt  the  (a,  /3,  5)-accuracy  [36]  to  formalize  the  security  requirements 
as  below: 

•  If  more  than  (3  fraction  of  the  packets  are  corrupted  by  a  malicious  node  m,  the  fault  lo¬ 
calization  protocol  will  raise  a  neighborhood  containing  m  or  one  of  its  colluding  nodes  as 
suspicious  with  probability  at  least  1  —  <5. 

•  In  benign  cases,  if  no  more  than  a  fraction  of  the  packets  are  spontaneously  corrupted  (e.g., 
dropped)  in  a  neighborhood,  the  fault  localization  protocol  will  raise  the  neighborhood  as 
suspicious  with  probability  at  most  6. 

The  thresholds  a  and  (3  are  introduced  to  tolerate  spontaneous  failures  (e.g.,  natural  packet 
loss)  and  are  set  by  the  network  administrator  based  on  her  experience  and  expectation  of  network 
performance. 
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Neighborhood-based  fault  localization  enables  the  network  administrator  to  scope  further  in¬ 
vestigation  to  a  1-hop  neighborhood  to  find  out  which  router  is  compromised.  It  is  also  possible 
to  further  employ  dedicated  monitoring  protocols,  which  only  need  to  monitor  a  small  region  (the 
identified  neighborhood)  of  the  network  to  find  the  specific  misbehaving  router. 

7.3  Challenges  and  Overview 

In  this  section,  we  first  describe  the  high-level  steps  of  a  general  neighborhood-based  fault  local¬ 
ization  and  then  explain  the  security  challenges  in  the  presence  of  strong  adversaries.  Finally,  we 
present  the  key  ideas  in  DynaFL  that  address  these  challenges. 

7.3.1  High-Level  Steps 

The  general  steps  a  neighborhood-based  fault  localization  takes  are  (i)  recording  local  traffic  sum¬ 
maries,  (ii)  reporting  the  traffic  summaries  to  the  AC,  and  (iii)  detecting  suspicious  neighborhoods 
by  the  AC  based  on  the  received  traffic  summaries,  as  we  sketch  below.  Though  intuitive,  these 
general  steps  face  several  security  vulnerabilities  and  scalability  challenges  as  Section  7.3.3  will 
show. 

Recording.  We  divide  the  time  in  a  network  into  consecutive  epochs ,  which  are  synchronous 
among  all  the  nodes  including  the  AC  in  the  network.  For  each  neighbor  r,  a  node  s  locally 
generates  traffic  summaries,  denoted  by  TSj4r  and  TS)Tr,  for  the  packet  streams  §sr  and  §rs  in 
each  epoch,  respectively.  Figure  7.3  depicts  the  router  state  in  a  toy  example. 


{.F(STs),cs>™rs} 


Figure  7.3:  Router  state  for  traffic  summaries. 


The  traffic  summary  recorded  by  a  node  s  should  reflect  both  the  packet  contents  and  the 
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arrival/departure  time  seen  at  node  s  to  enable  the  detection  of  malicious  packet  corruption  and 
delay.  For  the  sake  of  scalability,  the  traffic  summary  can  not  simply  be  an  entire  copy  of  all  the 
original  packets  (or  their  hashes  using  a  cryptographic  hash  function  such  as  SHA-1  which  provides 
one-wayness  and  collusion  resistance)  and  their  timing  information.  Instead,  we  use  a  fingerprinting 
function  7F  to  reflect  the  aggregates  of  packet  contents  to  save  both  the  router  state  and  bandwidth 
consumption  for  reporting  the  traffic  summaries  to  the  AC.  We  denote  the  fingerprint  for  a  packet 
stream  Srs  generated  by  r  as  as  Figure  7.3  depicts.  In  addition,  as  Figure  7.3  shows,  for  a 

packet  stream  (or  §sr),  the  traffic  summary  of  node  r  also  contains  the  average  departure  time 
tffs  (or  arrival  time  and  the  total  number  of  packets  nfis  (or  nfis)  in  §rs  (or  §sr)  seen  in  the 
current  epoch  to  enable  the  detection  of  packet  delay  attacks. 

Reporting.  At  the  end  of  each  epoch,  each  node  s  sends  its  local  traffic  summaries  to  the  AC. 

Detection.  After  receiving  the  traffic  summaries  at  the  end  of  an  epoch,  the  AC  runs  a  consis¬ 
tency  check  over  the  traffic  summaries  in  each  neighborhood.  A  large  inconsistency  of  the  traffic 
summaries  in  a  certain  neighborhood  N(s)  indicates  that  N(s)  is  suspicious. 

7.3.2  The  Fingerprinting  Function  T 

Before  we  present  the  instantiation  of  T7,  we  first  describe  the  general  properties  that  T  should 
satisfy.  To  enable  the  AC  to  detect  suspicious  neighborhoods,  T  should  generate  traffic  summaries 
with  the  following  two  properties: 

Property  1.  Given  any  two  packet  streams  §  and  E>  ,  the  “difference”  between  JF(§)  and  tF(§')  can 
give  an  estimation  of  the  difference  between  §  and  S' ,  denoted  by:  A(7r(§),  ^(S/)  A(S,§/). 

Defining  the  “difference”  between  T-'(S)  and  ECS')  is  .F-specific,  as  we  show  shortly. 

Property  2.  Given  any  two  packet  streams  §  and  §* ,  U  §>')  =  ^(S)  U  T (§>'). 

The  U  operator  on  the  left-hand  side  denotes  a  union  operation  of  the  two  packet  streams  § 
and  S'.  The  U  operator  on  the  right-hand  side  denotes  a  “combination”  of  J-(§)  and  J- (§>'),  which 
is  ^-specific  and  defined  shortly. 
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These  two  properties  enable  the  conversion  from  checking  packet  stream  conservation  to  checking 
the  conservation  of  traffic  summaries  in  a  neighborhood.  In  other  words,  these  two  properties  enable 
nodes  to  simply  store  the  compact  packet  fingerprints  instead  of  the  original  packet  streams  while 
still  enabling  the  AC  to  detect  the  number  of  packets  dropped,  modified,  and  fabricated  between 
two  packet  streams  from  their  corresponding  fingerprints. 

Specifically,  during  the  detection  phase,  the  AC  only  needs  to  compare  the  difference  between 

(i)  the  combined  traffic  summaries  for  packets  sent  to  node  s  in  N(s),  i.e.,  UjgN(s)  and 

(ii)  the  combined  traffic  summaries  for  packets  received  from  node  s  in  N(s),  i.e.,  Ujep}(s)  Jr(E>'Vs). 
By  Properties  1  and  2: 


(7.1) 


A(  u  .F(sn-),.  u/(SD) 

xSN(s)  jGN(s) 


=  A(iT(  U  S“>S),1F(  U  S)-5))  based  on  Property  2 
i£N(s)  ieN(s) 


A(  U  U  §)“s)  based  on  Property  1 

ieN(s)  i£N(s) 


Note  that  A(Uigpj(s)  Ujepj(s)  §)“s)  reflects  the  discrepancy  between  packets  sent  to  and  received 
from  node  s,  and  a  large  discrepancy  indicates  packet  dropping,  modification,  and  fabrication 
attacks  in  N(s). 


Sketch  for  T .  The  //^moment  estimation  sketch  [9,  30,  88]  (as  used  by  Goldberg  et  al.  [36]  for 
path-based  fault  localization)  serves  as  a  good  candidate  for  T .  More  specifically,  p^'rnornent  esti¬ 
mation  schemes  use  a  random  linear  map  to  transform  a  packet  stream  into  a  short  vector,  called 
the  sketch,  as  the  traffic  summary.  In  benign  cases,  packets,  if  viewed  as  1.5KB  (the  Maximum 
Transmission  Unit)  bit-vectors,  are  “randomly”  drawn  from  {0,  l}1536x8.  Hence,  different  packet 
streams  will  result  in  different  sketches  with  a  very  high  probability  (w.h.p.).  Goldberg  et  al.  [36] 
also  extensively  studied  how  to  estimate  the  number  of  packets  dropped,  injected,  or  modified  be¬ 
tween  two  packet  streams  from  the  “difference”  of  two  corresponding  sketch  vectors,  thus  satisfying 
Property  1.  Specifically,  the  difference  A(jF(§),  JF(S)  )  (used  in  Property  1)  between  two  sketch 
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vectors  is  defined  as: 

(7.2)  A(^(S),^(S)')  =  ||J-(S)-^(S),||P 

where  ||x||p  denotes  the  //^moment  of  the  vector  x.  We  can  further  prove  (see  Appendix  C)  that 
the  sketch  satisfies  Property  2  and  the  combination  of  .F(S)  and  used  in  Property  2  is  defined 

as: 

(7.3)  ^(S)  U  F(§)'  =  F(S)  +  T(S)' 
where  +  denotes  the  addition  of  two  vectors. 

7.3.3  Challenges  in  a  Neighborhood-based  fault  localization 

From  Property  1,  we  can  further  derive  the  following  conditions  on  the  fingerprinting  function  T. 
Given  any  two  packet  streams  §r  and  seen  at  nodes  r  and  t,  respectively,  a  fingerprinting  function 
computed  by  r  and  t  should  satisfy: 


(7.4) 

if  Sr  =  St,^(Sr)=^(St) 

(7.5) 

if  Sr  7^  St,^7(Sr)  /  ,F(§i)  w.h.p. 

The  first  condition  ensures  the  consistency  of  traffic  summaries  (more  precisely,  sketches  in  the 
traffic  summaries)  in  the  benign  case  when  the  packet  streams  are  not  corrupted  between  nodes 
r  and  t.  The  second  condition  ensures  that  if  packet  corruption  happens  between  nodes  r  and  t, 
inconsistency  of  the  traffic  summaries  will  be  observed,  which  will  then  enable  the  estimation  of 
packet  difference  in  the  corresponding  packet  streams  (Property  1).  However,  these  two  conditions 
tend  to  be  contradicting  and  lead  to  the  following  dilemma. 

T  without  different  secrets.  If  the  random  linear  map  in  T  (which  can  be  implemented  as  a 
hash  function  [21]),  is  not  computed  with  different  secret  keys  by  different  nodes,  a  malicious  node 
can  predict  the  T  output  of  any  other  node  for  any  packet.  Since  T  maps  a  set  of  packets  (or 


7.3.  CHALLENGES  AND  OVERVIEW 


121 


Figure  7.4:  An  example  of  stealthy  packet  modification  attacks  when  nodes  do  not  use  different 
secret  keys  for  computing  T .  For  simplicity,  the  sketch  vector  is  represented  as  a  ‘0-1’  bit  vector. 
The  malicious  node  s  modifies  the  packet  stream  in  such  a  way  that  the  modified  packet  stream  §f 
still  results  in  the  same  sketch  vector  as  §rs  at  node  t. 


.T(Srs)  — ^  [MO! 


T(§r3)  ->■ 


f  . 


§st(=§rs)  I  / 


/  JFfS!-*),  suspicious!_. 


Figure  7.5:  Illustration  of  the  difficulty  in  using  different  secret  keys  when  computing  IF.  The 
sketch  vector  is  represented  as  a  ‘0-1’  bit  vector  for  simplicity.  In  this  example,  nodes  r,  s  and  t 
use  different  secret  keys  when  computing  the  Sketch  to  generate  their  traffic  summaries. 


their  160-bit  cryptographic  hashes)  to  a  much  smaller  sketch,  hash  collisions  will  exist  where  two 
different  packets  produce  the  same  F  output  (since  sketch  is  not  proven  to  preserve  the  collision 
resistance  property  of  the  cryptographic  hash  function).  Hence,  a  malicious  node  can  leverage  such 
collisions  to  modify  packets  such  that  the  modified/fabricated  packets  will  produce  the  same  F 
output  at  other  nodes,  violating  the  condition  in  (7.5).  Figure  7.4  depicts  such  an  example. 


F  with  different  secrets.  If  nodes  compute  F  with  different  secret  keys  to  satisfy  the  condition 
in  (7.5),  it  is  hard  for  the  AC  to  perform  a  consistency  check  among  the  resulting  sketches.  For 
example,  even  the  same  packet  stream  would  result  in  different  sketches  at  different  nodes,  thus 
violating  the  condition  in  (7.4).  Figure  7.5  depicts  such  an  example.  Since  the  sketch  is  only  a 
compact  and  approximate  representation  of  the  original  packet  stream,  the  AC  cannot  revert  the 
received  sketches  to  the  original  packet  streams  to  check  packet  stream  conservation. 
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Scalability  vs.  sampling.  Even  with  T  for  packet  fingerprinting,  a  traffic  summary  over  a 
huge  number  of  packets  can  become  too  bandwidth-consuming  to  be  sent  frequently  to  the  AC 
(e.g.,  every  20  milliseconds).  For  example,  the  number  of  packets  for  an  OC-192  link  (lOGbps) 
can  be  on  the  order  of  10'  per  second  in  the  worst  case,  which  swells  the  size  of  a  sketch  to 
hundreds  of  bytes  to  bound  the  false  positive  rate  below  0.001  [36]  and  may  require  several  KB/s 
bandwidth  for  the  reporting  by  each  node.  Packet  sampling  represents  a  popular  approach  to 
reducing  bandwidth  consumption,  where  each  node  only  samples  a  subset  of  packets  to  feed  into 
T  for  generating  the  traffic  summaries.  To  enable  a  consistency  check  of  the  traffic  summaries  in 
a  neighborhood,  all  nodes  in  a  neighborhood  should  sample  the  same  subset  of  packets,  and  the 
challenge  is  how  to  efficiently  decide  which  subset  of  packets  all  nodes  should  agree  to  sample.  For 
security,  the  sampling  scheme  must  ensure  that  a  malicious  node  cannot  predict  whether  a  packet 
to  be  forwarded  will  be  sampled  or  not.  Otherwise,  the  malicious  node  can  drop  any  non-sampled 
packets  without  being  detected. 

The  problem  is  further  complicated  by  the  presence  of  collusion  attacks  in  our  strong  adversary 
model  as  well  as  our  path  obliviousness  requirement.  Several  existing  sampling  schemes  are  broken 
when  applied  to  our  setting.  For  example,  in  Symmetric  Secure  Sampling  (SSS)  [36],  the  packet 
sender  and  receiver  use  a  shared  Pseudo-Random  Function  (PRF)  V  to  coordinate  their  sampling. 
Imported  to  our  setting,  e.g.,  using  the  neighborhood  example  in  Figure  7.5,  nodes  r  and  t  share 
a  secret  key  Krt  and  a  PRF  V,  compute  V  with  Krt  for  each  packet,  and  sample  the  packet  if  the 
PRF  output  is  within  a  certain  range.  In  this  way,  node  s  itself  cannot  know  whether  a  packet  is 
sampled  or  not.  However,  this  approach  fails  in  our  setting.  Take  the  topology  in  Figure  7.5  for 
example: 

•  If  s  and  r  collude,  r  can  inform  s  of  which  packets  are  sampled,  so  that  s  can  safely  drop 
non-sampled  packets  and  not  be  detected. 

•  Due  to  the  dynamic  traffic  pattern,  an  ingress  node  r  of  a  neighborhood  N(s)  does  not  know 
which  egress  node  a  packet  will  traverse  in  N(s)  (if  s  has  more  neighbors  than  r  and  t,  there 
exist  multiple  possible  egress  nodes  than  t).  Hence,  r  does  not  know  which  PRF  or  secret  key 
to  use  for  packet  sampling,  given  that  r  shares  a  different  secret  key  with  each  node  in  N(s). 
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7.3.4  DynaFL  Key  Ideas 

In  DynaFL,  nodes  temporarily  store  the  cryptographic  hashes  (which  are  collision-resistant)  for 
all  packets  received/sent  per  neighbor  in  an  epoch.  At  the  end  of  each  epoch  e,  nodes  use  epoch 
sampling  to  decide  if  packets  in  the  epoch  are  to  be  fingerprinted;  if  so,  nodes  generate  the  traffic 
summaries  and  report  them  to  the  AC.  This  reduces  both  the  communication  overhead  for  sending 
the  traffic  summaries  to  the  AC  and  the  computational  overhead  for  generating  and  checking  the 
traffic  summaries.  Specifically,  nodes  first  use  the  same  per-epoch  sampling  key  Kg  (described 
shortly)  for  computing  a  PRF  V  to  determine  if  the  current  epoch  is  “selected”;  if  and  only  if 
the  current  epoch  is  selected,  nodes  will  use  T  with  the  same  per-epoch  fingerprinting  key  KJ 
(described  shortly)  to  map  packets  into  per-neighbor  traffic  summaries.  Using  the  same  ATf  and 
KJ  enables  consistency  checking  over  the  traffic  summaries  from  different  nodes. 

To  address  the  packet  modification  attacks  and  collusion  attacks  mentioned  earlier,  nodes  do 
not  know  the  per-epoch  Kg  and  KJ  until  the  end  of  each  epoch  e,  after  they  have  forwarded  (or 
possibly  corrupted)  packets  in  epoch  e.  Thus,  when  a  packet  is  to  be  forwarded  (or  corrupted),  a 
malicious  node  does  not  know  Kg  and  KJ,  and  thus  cannot  predict  whether  this  epoch  is  selected 
for  sending  traffic  summaries,  and  if  selected,  what  the  sketch  output  will  be  for  this  packet.  To 
achieve  this  property,  in  DynaFL,  the  trusted  AC  periodically  sends  the  per-epoch  K%  and  K^  via 
function  disclosure  messages  to  all  nodes  at  the  end  of  each  epoch  in  a  reliable  way  (described 
later)  and  nodes  use  the  received  Kg  and  Kj  to  select  epochs  and  fingerprint  packets  that  have 
already  been  forwarded  or  corrupted. 

A  malicious  node  may  first  attempt  to  locally  hold  all  the  packets  in  an  epoch  e,  and  only 
forward  or  corrupt  packets  at  the  end  of  e  when  the  malicious  node  learns  Kg  and  K thus  being 
able  to  launch  the  sophisticated  packet  modification  and  selective  packet  corruption  attacks  as 
mentioned  earlier.  However,  since  the  traffic  summaries  also  include  the  average  departure/arrival 
time  of  the  sent /received  packets,  the  malicious  node  will  be  detected  with  packet  delay  misbehavior 
in  the  detection  phase. 

Sections  7.4,  7.5,  and  7.6  detail  the  recording,  reporting,  and  detection  phases  in  DynaFL, 
respectively.  Section  7.7  presents  the  security  analysis  and  Section  7.8  evaluates  DynaFL’s  perfor¬ 
mance  through  measurements  and  simulations. 
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7.4  Recording  Traffic  Summaries 

Nodes  in  DynaFL  generate  their  traffic  summaries  by  first  temporarily  storing  all  the  received  and 
sent  packets  for  each  epoch  along  with  aggregate  timing  information.  Then  upon  receiving  the 
keys  disclosed  by  the  AC,  nodes  determine  if  the  current  epoch  is  selected  with  a  keyed  PRF  V, 
and  if  so,  fingerprint  the  cached  packets  with  keyed  J- .  The  technical  challenges  in  the  recording 
phase  are  how  to  deal  with  imperfect  time  synchronization  among  nodes  and  packet  transmission 
delay,  and  how  to  efficiently  protect  the  function  disclosure  message  from  adversarial  corruption. 
We  explain  how  DynaFL  solves  these  challenges  in  turn  below. 

Definition  20.  The  epoch  IDs  are  labeled  as  0,  1,  2,  ....  If  the  current  network  time  is  t,  then  the 
current  epoch  ID  is  [j\,  where  l  is  the  epoch  length. 

7.4.1  Storing  Packets 

In  the  “ideal”  case  (with  perfect  time  synchronization  and  no  packet  transmission  delay),  nodes 
simply  need  to  store  packets  for  the  single  “current”  epoch  and  at  the  end  of  each  epoch  send  the 
traffic  summaries  to  the  AC  for  that  epoch.  However,  in  practice,  routers  need  to  determine  which 
epoch  an  incoming  packet  belongs  to  (or  whether  a  received  packet  belongs  to  the  current  epoch  or 
a  previous,  outdated  epoch).  One  might  attempt  to  let  routers  map  received  packets  into  epochs 
based  on  their  local  packet  arrival  time.  However,  this  approach  would  introduce  large  errors  for 
the  following  reasons: 

•  Though  all  the  nodes  in  the  network  are  loosely  time-synchronized,  e.g.,  ±1  millisecond,  the 
epoch  intervals  at  different  nodes  may  still  be  misaligned  by  up  to  a  few  milliseconds.  This 
misalignment  will  result  in  a  considerable  number  of  packets  being  attributed  to  different 
epochs  at  different  nodes,  thus  causing  inconsistencies  in  the  corresponding  packet  finger¬ 
prints. 

•  Due  to  the  network  transmission  delay,  a  packet  sent  by  a  source  at  epoch  e  may  arrive  at 
another  node  at  a  different  epoch  e  +  i.  In  other  words,  a  packet  may  have  been  received 
by  an  ingress  node  but  not  the  egress  node  of  a  neighborhood  at  the  end  of  an  epoch  when 
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nodes  need  to  generate  their  packet  fingerprints,  thus  producing  inconsistencies  in  the  traffic 
summaries. 

To  deal  with  imperfect  time  synchronization,  the  source  in  DynaFL  embeds  a  local  timestamp 
when  sending  each  packet.  Such  a  timestamp  can  be  added  as  an  additional  flow  header,  using  the 
TCP  timestamp,  or  in  the  IP  option  held  that  all  routers  can  process  efficiently.  Any  router  in  the 
forwarding  path  will  determine  the  corresponding  epoch  for  each  packet  based  on  the  embedded 
timestamp.  In  this  way,  we  ensure  that  all  routers  put  each  packet  in  the  same  epoch  for  updating 
the  traffic  summaries.  For  example,  if  the  timestamp  embedded  by  the  source  is  ts  and  the  epoch 
length  is  L,  then  all  routers  will  map  the  packet  into  epoch  L^J- 

To  eliminate  traffic  summary  inconsistencies  due  to  packet  transmission  delay,  we  also  need  to 
ensure  that  when  generating  traffic  summaries  for  a  certain  epoch  e,  packets  that  are  sent  and  not 
corrupted  in  epoch  e  are  received  by  all  the  nodes  in  the  forwarding  paths.  To  this  end,  if  the 
epoch  length  is  set  to  L  and  the  expected  upper  bound  on  the  one-way  packet  transmission  delay 
in  the  network  is  D,  each  router  stores  packets  sent  in  the  current  epoch  e  as  well  as  in  previous 
\j^~\  epochs,  denoted  by  e  —  1,  e  —  2, . . . ,  e  —  [~^].  We  call  these  epochs  live  epochs.  Then  at 
the  end  of  an  epoch  e,  nodes  will  generate  and  send  to  the  AC  the  traffic  summaries  for  the  oldest 
live  epoch  e  —  in  which  the  packets  have  either  traversed  all  nodes  in  their  forwarding  paths 
or  been  corrupted.  The  periodic  function  disclosure  messages  that  the  AC  sends  synchronize  the 
current  epoch  ID  and  the  oldest  live  epoch  ID  for  which  traffic  summaries  are  needed  for  reporting. 

Hence,  a  node  s  maintains  the  following  data  structures  for  each  neighbor  r  for  each  epoch,  as 
Figure  7.6  also  shows. 

•  The  packet  cache  C(7r  temporarily  stores  hashes  for  packets  in  both  and  §Jrr  that  are 
seen  in  a  live  epoch  (using  a  cryptographic  hash  function  such  as  SHA-1).  Each  entry  contains 
the  packet  hash  and  a  bit  indicating  if  the  packet  belongs  to  §;Er  or  S(rr. 

•  The  router  stores  the  sum  of  packet  departure  timestamps  tj*r  seen  in  §7*r  and  the  sum  of 
packet  arrival  timestamps  t^~r  seen  in  §(Tr  in  a  live  epoch  with  millisecond  precision. 

•  Finally,  the  router  stores  the  total  number  of  packets  nj*r  seen  in  and  n^~r  seen  in  §(rr 
in  a  live  epoch. 


126 


CHAPTER  7.  DYNAFL 


crypto  hash 


Sys  Of 


Epoch  ID 

f*  r 

nTr 

f  *r 

nTr 

crr 

Figure  7.6:  Router  per-neighbor  state  details. 

Among  these  data  structures,  t^~r ,  tj*r,  n)Tr,  and  nj'r  require  small  constant  storage,  around 
8  or  4  bytes  for  each.  C^r  will  be  used  for  packet  fingerprinting.  The  size  of  C^r  depends  only 
on  the  epoch  length  L  and  link  bandwidth,  but  not  the  number  of  flows/paths  traversing  node  s. 
As  Section  7.8.1  shows,  with  an  epoch  length  of  20  milliseconds  and  one-way  network  latency  of  20 
milliseconds,  each  router  line-card  requires  only  around  4MB  of  memory  for  an  OC-192  link,  which 
is  readily  available  today. 

For  simplicity’s  sake,  we  use  C^r  and  CJrr  to  denote  the  packets  cached  for  §7>r  and  §)Tr  by 
node  s,  respectively. 


7.4.2  Secure  Function  Disclosure 


At  the  end  of  each  epoch  e,  the  AC  discloses  the  sampling  key  Ks 

^-rfi 


and  fingerprinting  key 


K j:  L  to  all  nodes  in  the  network  via  a  function  disclosure  message  dAC  ,  and  requests  the  traffic 
summaries  for  the  oldest  live  epoch  e  —  Obviously,  dAc  itself  needs  to  be  protected  from 

data-plane  attacks  (dropping,  modification,  fabrication,  or  delaying)  by  a  malicious  node  during 
end-of-epoch  broadcasting.  It  might  be  tempting  to  let  the  AC  use  digital  signatures  to  authenticate 
dAC  hr  order  to  address  malicious  modification  and  fabrication;  however,  frequently  generating  and 
verifying  the  signatures  on  a  per-epoch  basis  can  be  expensive  (e.g.,  an  epoch  can  be  as  short  as  20 
milliseconds  and  signature  generation  and  verification  time  could  be  on  the  order  of  milliseconds). 

Fortunately,  the  function  disclosure  message  dAc  is  transmitted  at  the  end  of  each  epoch  syn¬ 
chronously  among  all  the  nodes.  If  a  malicious  node  s  drops  dAc,  the  AC  will  fail  to  receive  the 
traffic  summaries  of  certain  neighbors  of  s,  thus  detecting  N(s)  as  suspicious.  For  example  in 
Figure  7.7,  if  s  drops  dAc  instead  of  forwarding  it  to  its  neighbor  r,  node  r  cannot  fingerprint 
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Figure  7.7:  Possible  attacks  in  the  recording  phase.  A  malicious  node  s  may  attempt  to  drop  the 
function  disclosure  message  dAC,  or  manipulate  the  TTL  value  to  cause  packets  to  be  dropped  at  a 
remote  place  (node  a  in  this  example),  thus  framing  a  remote  neighborhood  (N(a)  in  this  example). 


the  packets  to  generate  traffic  summaries,  thus  failing  the  consistency  check  of  traffic  summaries 
in  N(s).  As  we  show  in  Section  7.5,  the  AC  expects  to  receive  traffic  summaries  within  a  short 
amount  of  time  after  each  epoch  ends;  delaying  dAc  more  than  that  amount  of  time  is  effectively 
equivalent  to  dropping  dAC  and  causes  the  malicious  node’s  neighborhood  to  be  detected.  Thus, 
the  remaining  problem  is  to  prevent  the  modification  and  fabrication  of  dAC,  which  is  equivalent 
to  authenticating  dAC  to  all  nodes  in  the  network  without  the  use  of  digital  signatures.  Section  7.7 
further  elaborates  why  the  authentication  of  dAC  is  needed  for  security  purposes. 

In  DynaFL,  time  in  the  network  is  loosely  time-synchronized  and  divided  into  consecutive 
epochs;  the  authentication  of  dAC  is  required  only  once  per  epoch.  This  setting  is  naturally  aligned 
with  that  of  the  TESLA  broadcast  authentication  [77],  which  authenticates  broadcast  messages 
(dAC  in  our  case)  using  only  Message  Authentication  Codes  (MACs)  with  keys  derived  from  a  one¬ 
way  hash  chain.  As  Figure  7.8  shows,  the  AC  applies  a  one-way  hash  function  H  repeatedly  on  the 
root  key  Kr  to  derive  a  set  of  epoch  authentication  keys,  and  uses  key  Ke  to  compute  a  MAC  for 
authenticating  dAC  in  epoch  e.  The  AC  publishes  Kq  through  the  network  so  that  nodes  can  verify 
if  any  given  epoch  key  is  indeed  derived  from  the  genuine  one-way  hash  chain.  Then  dAC  in  epoch  e 
includes  (i)  the  current  epoch  ID  e,  the  oldest  live  epoch  ID  j  =  e  —  \j^]  to  be  examined,  sampling 
and  fingerprinting  keys,  a  MAC  computed  with  Ke  for  the  current  epoch,  and  (ii)  the  key  Kj  for 
computing  the  MAC  in  a  previous  epoch  j,  by  which  nodes  can  verify  the  authenticity  of  dAc  in 
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H{K\)  H(K2)  H(Kr- 1)  H(Kr) 

K0  ^ - Ki  ^ .  .  - Kr~  i  -< - i^r 

Figure  7.8:  One-way  hash  chain  example, 
epoch  j  (verification  delayed  by  \j^~\  epochs).  That  is: 

dAC  =e\\j\\K3s\\K3f\\ 

(7.6)  MACKe(e\\j\\Ki\\K})\\ 

Kj 

where  ||  denotes  concatenation.  Section  7.7  describes  the  reason  for  disclosing  the  key  for  epoch 
j  =  e  —  \j^)  instead  of  epoch  e  —  1. 

Furthermore,  DynaFL  creates  a  spanning  tree  in  the  network  rooted  at  the  AC,  along  which 
dAC  is  delivered  to  each  node.  Since  DynaFL  uses  a  pre- generated,  static  spanning  tree  for  the 
broadcast  messages,  there  is  no  need  for  dynamic  path  support  when  protecting  dAC. 

7.4.3  Sampling  and  Fingerprinting 

Given  the  disclosed  Kl  and  Kj  at  the  end  of  an  epoch  e,  each  node  t  first  uses  the  sampling  PRF 
V  with  K{,  denoted  by  V}<J ,  to  determine  if  the  oldest  live  epoch  j  is  selected.  If  so,  node  t  then 
uses  the  fingerprinting  function  T  to  map  the  cached  packet  hashes  in  each  per-neighbor  stream 
into  a  sketch  vector,  i.e.,  7Fkj{ C^r)  or  J-f<j  (C)rr),  computed  with  the  given  Kj.  Finally,  node  t 
generates  two  traffic  summaries  Tpr  and  T)- r  for  a  neighbor  r: 

•  TC7-  for  packet  stream  §~j~*r  includes  a  fingerprint  TKi  (Cp^),  average  packet  departure  time 

,  and  the  total  number  n~j~*r  of  packets  seen  in  in  epoch  j ; 

•  T)“r  for  packet  stream  S)“r  includes  a  fingerprint  (Cj~r),  average  packet  arrival  time 
t^1  =  -A=r,  and  the  total  number  n]~r  of  packets  seen  in  in  epoch  j. 

Figure  7.9  summarizes  the  fault  localization-related  packet  processing  inside  a  DynaFL  router. 
We  detail  V  and  7F  in  the  following. 
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Per-packet  operations 


Per-epoch  operations 


Figure  7.9:  fault  localization-related  packet  processing  inside  a  DynaFL  router. 
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Implementing  V.  Specifically,  V  maps  an  epoch  ID  to  a  n-bit  integer  uniformly  distributed  in 
[0,2n  —  1],  Given  a  sampling  rate  A  6  (0,1),  a  node  computes  VRj  over  the  epoch  ID  j  that  is 
being  examined,  and  epoch  j  is  selected  iff: 

(7.7)  V Rj  ( j )  <  A  •  2n 

In  this  way,  on  average  a  fraction  A  of  the  epochs  will  be  selected.  Since  nodes  use  V  with  the  same 
Ks  for  epoch  sampling,  in  benign  case,  nodes  will  select  the  same  set  of  epochs,  thus  ensuring  the 
consistency  of  the  traffic  summaries  in  a  neighborhood. 


Implementing  T.  We  use  the  second-moment  sketch  computed  with  Kj  as  a  case  study  to 
implement  J-,  and  analyze  the  size  of  the  sketch  vector  to  achieve  Property  1  with  the  (a,  /3,  in¬ 
accuracy.  We  assume  10"  packets  per  second  in  the  worst  case  for  an  OC-192  link  with  an  epoch 
length  of  L  (seconds).  Then,  the  number  of  packets  r/  in  a  sampled  epoch  is  rj  =  L  ■  107.  Using  the 
classical  Sketch  due  to  Alon  et  al.  [11]  for  example,  the  storage  requirement  for  the  sketch  is  given 
by: 


(7.8) 


,  ,  ,  ,2001V. 

M  x  log2  \  2r]  ln( — - — ) 


(7.9) 


,  12  1  1 

where  M  >  - In  — 

ez  3  —  2e  d 


and  e  = 


/3  —  a 


[3  T  ol 

In  Section  7.8.1  we  derive  numeric  values  for  the  size  of  the  sketch  vector  based  on  the  epoch  length 
L. 


Dealing  with  TTL  attacks.  Certain  fields  in  the  IP  header,  such  as  the  TTL,  checksum,  and 
some  IP  option  fields,  will  change  at  each  hop.  Both  sampling  and  fingerprinting  in  DynaFL  need 
to  properly  deal  with  these  variant  fields  to  avoid  both  false  positives  and  false  negatives.  Take  the 
TTL  field  for  instance  hereinafter  (though  the  arguments  apply  similarly  to  other  variant  fields). 
On  the  one  hand,  if  V  and  T  are  computed  over  the  entire  packets  including  the  TTL  field,  even 
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Figure  7.10:  Example  of  secure  transmission  of  traffic  summary  reports.  For  brevity,  we  denote  the 
traffic  summaries  of  a  node  i  as  T*  and  omit  the  secret  key  for  the  MAC  notation. 

in  the  benign  case  the  same  packet  stream  will  leave  different  traffic  summaries  (or  precisely,  the 
sketch  vectors)  at  ingress  and  egress  nodes.  On  the  other  hand,  if  V  and  F  are  computed  over 
the  entire  packets  excluding  the  TTL  field,  a  malicious  node  can  modify  the  TTL  field  at  liberty 
without  affecting  the  traffic  summaries.  Figure  7.7  depicts  an  example  TTL  attack,  where  the 
malicious  node  s  lowers  the  TTL  value  to  2  in  the  packets  and  causes  the  packets  to  be  dropped 
at  the  2-hop-away  downstream  node  a,  thus  framing  neighborhood  N(a). 

To  address  the  TTL  attacks,  when  computing  V  and  F,  each  node  r  performs  either  of  the 
following: 

•  For  a  packet  received  from  a  neighbor,  node  r  computes  V  and  F  over  the  entire  packet 
including  the  TTL  field. 

•  For  a  packet  sent  to  a  neighbor,  node  r  computes  V  and  F  over  the  packet,  but  with  the  TTL 
field  additionally  decreased  by  2  (equal  to  the  TTL  value  at  the  2-hop-away  egress  node  in 
N(r)). 

In  this  way,  node  r  in  Figure  7.7  simply  uses  the  TTL  value  as  contained  in  the  packets  received 
from  s  when  computing  F  and  V ,  since  the  ingress  nodes  in  N(s)  (nodes  i  and  j)  must  have 
computed  IF  and  V  with  an  adjusted  TTL  value  equal  to  that  at  node  r. 

The  TTL  value  in  a  packet  is  also  decremented  by  one  for  every  second  the  packet  is  buffered 
at  a  router.  Holding  a  packet  longer  than  one  second  at  a  router  is  treated  as  a  packet  delaying 
attack  and  will  be  detected  due  to  the  use  of  the  above  construction. 
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7.5  Reporting  Traffic  Summaries 

If  an  epoch  is  selected,  after  fingerprinting,  a  node  t  generates  two  traffic  summaries  T^r  and  T]~r 
for  each  neighbor  r,  and  sends  them  to  the  AC  in  a  traffic  summary  report  denoted  by  72*.  The 
challenge  in  the  recording  phase  is  to  protect  the  traffic  summary  reports  from  being  corrupted. 

In  DynaFL,  nodes  form  a  static  spanning  tree  rooted  at  the  AC  for  sending  the  traffic  summaries. 
Given  the  spanning  tree,  the  goal  is  to  protect  the  traffic  summary  reports  72.* s  from  different  nodes 
destined  to  the  AC.  Although  72*s  are  also  subject  to  data-plane  attacks,  they  are  transmitted  over 
static  and  pre-generated  paths  in  the  spanning  tree.  Hence,  dynamic  traffic  is  no  longer  a  concern, 
thus  substantially  simplifying  the  problem.  Specifically,  DynaFL  utilizes  an  Onion  Authentication 
approach  to  protect  the  transmission  of  dAC  along  each  path  in  the  spanning  tree.  In  a  nutshell, 
within  a  short  timer  at  the  end  of  each  epoch,  each  node  t  needs  to  send  its  traffic  summary  report 
72*  to  the  AC,  and  72*  is  authenticated  with  a  MAC  computed  using  a  pairwise  secret  key  shared 
between  node  t  and  the  AC.  The  traffic  summary  reports  from  different  nodes  are  sent  in  an  onion 
fashion.  For  example  in  Figure  7.10,  IZj  includes  the  report  72*,  of  node  k.  In  this  way,  DynaFL 
efficiently  protects  dAC  without  the  use  of  expensive  asymmetric  cryptography.  Section  7.7  gives  a 
more  detailed  security  analysis  of  such  an  Onion  Authentication  approach. 

7.6  Detection 

The  AC  performs  consistency  checks  for  each  neighborhood  N(r)  based  on  the  received  traffic  sum¬ 
maries.  However,  since  an  epoch  may  only  have  a  small  number  of  packets,  detecting  a  suspicious 
neighborhood  based  on  the  consistency  checks  for  individual  epochs  can  introduce  a  large  error 
rate.  Take  an  extreme  case  for  example:  if  in  a  certain  epoch  a  neighborhood  N(r)  only  transmits 
a  single  packet  and  the  packet  was  spontaneously  lost,  concluding  that  the  packet  loss  rate  is  100% 
and  N(r)  is  suspicious  would  be  inaccurate. 

To  deal  with  this  problem,  we  still  perform  the  consistency  checks  and  estimate  the  discrepancy 
for  individual  epochs,  but  make  the  detection  based  on  the  aggregated  discrepancies  over  a  set  of 
E  epochs  (called  accumulated  epochs),  so  that  the  total  number  of  packets  over  the  E  epochs  is 
more  than  a  certain  threshold  N  to  give  a  high  enough  accuracy  (e.g.,  >  99.9%)  on  the  detection 
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results.  Section  7.8  studies  the  value  of  N.  Therefore,  the  AC  stores  the  traffic  summaries  for 
each  neighborhood  and  makes  detection  when  the  total  number  of  packets  N  is  reached.  More 
specifically,  let  n^~y(e)  and  n^y{e)  denote  the  rCTy  and  r£Cv  in  the  traffic  summary  for  epoch  e, 
respectively;  for  a  certain  neighborhood  N(r),  whenever 

(7.10)  max{Y^  ntr(e)}  >  N 

e  i  e  i 

(where  i  £  N(r)  and  e  iterates  over  all  the  accumulated  epochs),  indicating  N  is  reached,  the  AC 
performs  the  following  checks  to  inspect  if  N(r)  is  suspicious: 


1.  Flow  conservation.  The  AC  first  extracts  n^r(e)  and  n*~r(e)  for  each  node  i  in  N(r)  for 
each  epoch  e,  and  calculates  the  difference  between  the  number  of  packets  sent  to  r  and  the  number 
of  packets  received  from  r  over  all  the  E  accumulated  epochs.  If  the  ratio  of  the  difference  to  the 
total  number  of  packets  in  all  the  E  accumulated  epochs  is  larger  than  a  threshold  (3,  i.e.: 

(7U)  I  Ee  Vr (e)  -  Ee  Ei  nTr (e)j  o 

1  J  ™«{E.EiVr(e).EeEi«r(e)} 

then  the  AC  detects  N(r)  as  suspicious.  The  threshold  /?  is  set  based  on  the  administrator’s 
expectation  of  the  natural  packet  loss  rate;  e.g.,  in  the  simulations  in  Section  7.8  we  set  /3  to  be 
four  times  of  the  natural  packet  loss  rate  in  a  neighborhood. 


2.  Content  conservation.  The  AC  then  extracts  the  sketches  in  the  traffic  summaries  in  N(r), 
and  estimates  the  discrepancy  5t  between  the  sketches  for  packets  sent  to  r  and  the  sketches  for 
packets  received  from  r.  The  AC  detects  N(r)  as  malicious  if  Sf  is  larger  than  a  certain  threshold, 
i.e.,: 


6f  > 


2a/3 
OL  H-  (3 


x  max{^2^2n^r{e),J2^2nrr(e)} 

e  i  e  i 


(7.12) 


where 
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It  has  been  proven  [36]  that  the  above  threshold  can  satisfy  the  (a,  /?,  <5)-accuracy  defined  in  Sec¬ 
tion  7.2.1. 

3.  Timing  consistency.  Finally,  the  AC  extracts  the  difference  between  the  average  packet 
departure  time  and  arrival  time,  and  concludes  that  N(r)  is  suspicious  if  the  difference  is  larger 
than  the  expected  upper  bound  on  the  2-hop  link  latency. 

7.7  Security  Analysis 

We  show  that  DynaFL  is  secure  against  all  attacks  that  are  possible  in  the  misbehavior  space  given 
our  adversary  model.  By  our  definition,  a  malicious  router  can  drop,  modify,  fabricate,  and  delay 
packets.  In  addition,  a  malicious  router  can  attack  data  packets,  function  disclosure  messages  dAC, 
and  reporting  messages.  We  first  show  DynaFL’s  security  against  a  single  malicious  node  and  then 
sketch  DynaFL’s  security  against  colluding  nodes. 

Security  against  corrupting  the  data  packets.  Dropping,  modifying,  and  fabricating  data 
packets  in  a  neighborhood  N(m)  will  cause  inconsistencies  between  sketches  in  N(m)  as  mentioned 
earlier.  Delaying  data  packets  in  N(m)  will  cause  abnormal  deviation  between  average  packet 
arrival/departure  timestamps  in  N (m).  If  a  malicious  router  changes  the  timestamps  in  data  packets 
embedded  by  the  source  nodes,  it  is  equivalent  to  modifying  packets  and  packets  may  be  mapped 
to  different  epochs,  in  which  case  such  an  attack  will  manifest  itself  by  causing  inconsistencies  in 
the  sketches  of  a  neighborhood  containing  the  malicious  router. 

Security  against  corrupting  dAC.  As  we  mentioned  earlier,  if  a  malicious  node  m  drops  the 
dAc;  some  nodes  adjacent  to  m  will  fail  to  send  the  correct  traffic  summaries  to  the  AC,  thus  causing 
a  neighborhood  containing  m  to  be  detected.  We  note  that  the  authentication  of  dAc  is  needed. 
Otherwise,  a  malicious  node  can  replace  the  sampling  and  fingerprinting  keys  with  its  own  fake  keys, 
by  which  the  malicious  node  can  predict  the  output  of  other  nodes’s  sketches  and  perform  packet 
modification  attacks.  In  addition,  if  the  epoch  IDs  in  dAC  were  not  authenticated,  a  malicious  node 
can  replace  the  oldest  live  epoch  ID  in  dAc  for  which  the  traffic  summaries  are  requested  with  the 
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current  epoch  ID.  In  this  way,  inconsistencies  of  traffic  summaries  can  be  detected  for  some  benign 
neighborhood  due  to  the  packet  transmission  delay  as  Section  7.4.1  describes.  With  the  (delayed) 
authentication  of  dAC,  any  attempt  to  modify  dAc  will  be  detected  (after  |~^~|  epochs). 

It  is  noteworthy  that  the  dAc  sent  at  the  end  of  epoch  e  cannot  simply  disclose  the  MAC  secret 
key  Ke_\  for  the  previous  epoch  e  —  1.  This  is  because  at  the  time  Ke_\  is  disclosed,  the  dAC  sent 
at  the  end  of  epoch  e  —  1  may  still  have  not  reached  certain  nodes.  Hence,  a  malicious  node  which 
has  already  received  Ke- \  might  send  Ke-\  to  a  downstream  colluding  node  via  an  out-of-band 
channel,  so  that  the  colluding  node  can  break  the  authenticity  of  the  dAC  sent  in  epoch  e  —  1.  Hence, 
at  the  end  of  an  epoch  e,  we  disclose  the  MAC  key  for  epoch  e  —  \j^]  to  ensure  the  dAC  sent  in 
epoch  e  —  \yP\  has  reached  all  the  nodes  in  the  network. 

Security  against  corrupting  the  reporting  messages.  First,  due  to  the  use  of  the  Onion 
Authentication,  a  malicious  node  m  cannot  selectively  drop  the  reporting  messages  of  a  remote 
(non-adjacent)  node  r,  to  frame  a  neighborhood  containing  node  r.  Since  all  the  accumulated 
reporting  messages  are  “combined”  at  each  hop,  m  can  only  drop  the  reporting  messages  from  its 
immediate  neighbors,  which  will  manifest  a  neighborhood  containing  m  as  suspicious. 

Security  against  colluding  attacks.  We  illustrate  DynaFL’s  security  against  colluding  attacks 
via  a  toy  example  shown  in  Figure  7.11.  We  show  that  for  a  malicious  node  m  which  actually 
corrupts  packets,  as  long  as  one  benign  node  exists  in  N(m),  a  neighborhood  containing  either  m  or 
one  of  its  colluding  nodes  will  be  detected.  The  key  observation  is  that  since  the  traffic  summaries 
are  sent  to  the  AC  and  the  AC  performs  the  detection,  each  node  can  only  claim  one  traffic  summary 
per  selected  epoch.  To  simplify  the  analysis  while  still  unveiling  the  intuition,  we  only  consider  the 
number  (but  not  the  payload)  of  packets  sent  by  each  node,  as  shown  in  Figure  7.11.  Suppose 
nodes  c  and  d  are  colluding,  and  node  d  drops  50  packets.  As  long  as  node  e  is  benign  in  N(d),  to 
cover  the  misbehavior  of  d,  the  colluding  node  c  has  to  send  a  traffic  summary  to  the  AC  falsely 
claiming  it  sent  “50”  packets  to  d  (and  thus  received  “50”  packets  from  node  b).  However,  this 
claim  will  make  the  neighborhood  N (6)  suspicious  since  the  benign  node  a  will  claim  it  sent  100 
packets  to  b. 
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Figure  7.11:  Example  of  DynaFL’s  security  against  colluding  nodes.  A  number  denotes  the  packet 
count  each  node  sends. 

7.8  Performance  Evaluation 

In  this  section,  we  analyze  the  protocol  overhead  and  study  the  detection  efficiency  of  DynaFL  via 
measurements  and  simulations,  with  our  implementation  of  the  classic  Sketch  [11]  in  C++. 

7.8.1  Storage  Overhead 

DynaFL  incurs  only  per-neighbor  state  while  existing  secure  path-based  fault  localization  protocols 
require  per-source  and  per-path  state.  In  this  section,  we  quantify  the  per-neighbor  storage  overhead 
of  a  DynaFL  router  r,  which  primarily  includes  the  packet  cache  and  the  sketch  for  each  neighbor 

s. 


Sketch  size.  We  derive  numeric  values  of  the  sketch  size  based  on  Equations  7.8  and  7.9,  using 
an  example  setting  where  the  average  packet  size  is  300  bytes  and  the  link’s  capacity  is  10  Gbps 
(an  OC-192  link).  Furthermore,  we  consider  5  =  0.001  and  (3  =  2a  for  the  (a,/3,6)- accuracy,  i.e., 
the  false  positive  rate  and  false  negative  rate  of  the  sketch-based  detection  are  limited  under  0.001. 
Figure  7.12  plots  the  result,  from  which  we  can  see  that  a  sketch  with  fewer  than  500  bytes  can 
already  yield  a  desirable  accuracy. 


Cache  size  and  per-neighbor  storage  overhead.  We  now  study  the  cache  size  for  temporarily 
storing  packet  hashes  in  live  epochs,  which,  together  with  the  sketch  size  analyzed  above,  constitutes 
the  per-neighbor  storage  overhead  of  a  DynaFL  router.  We  denote  the  upper  bound  of  one-way 
network  latency  as  D,  epoch  length  as  L,  and  the  number  of  packets  per  second  as  r/.  Using  20-byte 
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Figure  7.12:  Sketch  size  for  an  OC-192  link  with  the  average  packet  size  of  300  bytes  and  5  =  0.001. 
packet  hashes,  the  cache  size  is  given  by: 

(7.13)  fr +  11  x  20  '  7  '  L 

We  omit  the  1-bit  indicator  for  each  packet  hash  entry  to  indicate  which  packet  stream  the  packet 
belongs  to  (see  Figure  7.6).  Assuming  the  per-neighbor  sketch  size  is  500  bytes,  one-way  latency 
D  =  20ms,  and  the  average  packet  size  is  300  bytes  for  an  OC-192  link,  we  derive  the  per-neighbor 
storage  overhead  of  a  DynaFL  router  with  different  epoch  lengths  shown  in  Figure  7.13.  We  can 
observe  that,  with  an  epoch  length  of  20ms,  only  around  4MB  is  required  per-neighbor.  The 
“humps”  exist  in  the  curve  due  to  the  use  of  the  ceiling  function  in  Equation  7.13. 

7.8.2  Key  Management  Overhead 

One  distinct  advantage  DynaFL  presents  is  that  each  router  in  DynaFL  shares  only  one  secret  key 
with  the  AC,  whereas  in  path-based  fault  localization  protocols  it  is  necessary  for  each  router  to 
share  a  secret  key  with  each  source  node  in  the  network  in  the  worst  case  [21],  which  dramatically 
complicates  the  key  management  and  broadens  the  vulnerability  surface.  To  quantify  DynaFL’s 
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Figure  7.13:  Router  per-neighbor  for  an  OC-192  link  with  the  average  packet  size  of  300  bytes  and 
one-way  network  latency  as  20ms. 


Figure  7.14:  Key  management  overhead  at  each  router.  A  router  in  DynaFL  always  requires  just 
one  key  shared  with  the  AC  (hence  not  visible  in  the  figure). 
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advantage  over  path-based  fault  localization  protocols,  we  leverage  the  measured  ISP  topologies 
from  the  Rocketfuel  dataset  [84]  and  the  topology  from  Internet2  [8].  Figure  7.14  shows  the  maxi¬ 
mum  number  of  keys  each  router  needs  to  manage  in  path-based  fault  localization  protocols;  and  a 
router  in  DynaFL  always  requires  only  one  secret  key  shared  with  the  AC  (thus  invisible  in  the  fig¬ 
ure)  .  We  can  see  that  the  number  of  keys  a  router  needs  to  manage  in  path-based  fault  localization 
protocols  is  100  to  10000  times  higher  than  that  in  DynaFL. 

7.8.3  Bandwidth  Overhead 

We  analyze  the  bandwidth  consumption  on  each  link  by  the  reporting  traffic  summaries  based  on 
the  measured  ISP  topologies  from  the  RocketFuel  dataset  [84],  Recall  that  the  reporting  messages 
are  transmitted  along  a  spanning  tree  rooted  at  the  AC.  Hence,  the  bandwidth  consumption  by 
the  reporting  messages  on  a  link  is  determined  by  the  number  of  children  below  that  link  and  the 
degrees  of  the  children. 

For  each  ISP  topology,  we  first  select  a  “central”  node  as  the  AC,  which  is  the  node  in  the 
network  that  has  the  highest  fraction  of  all  shortest  paths  that  pass  through  that  node.  Then,  we 
create  a  minimum  spanning  tree  rooted  at  the  central  node  (or  the  AC)  for  transmitting  reporting 
messages  to  the  AC.  We  consider  the  epoch  length  L=20ms,  a  per-neighbor  traffic  summary  as  500 
bytes,  and  the  epoch  sampling  rate  is  1%.  Hence,  on  average,  each  node  only  sends  one  reporting 
packet  in  every  two  seconds.  Figure  7.15  plots  the  results  for  ISPs  with  AS  numbers  1221,  1239, 
1755,  3257,  3967,  and  6461.  From  the  results,  we  can  see  that  the  fraction  of  bandwidth  used  for 
reporting  traffic  summaries  on  a  link  is  small  for  all  topologies  (e.g.,  between  0.002%  and  0.012% 
for  an  OC-192  link). 

7.8.4  Detection  Delay 

As  Section  7.6  states,  the  AC  performs  consistency  checks  and  detects  any  anomalies  only  when 
the  total  number  of  packets  over  multiple  epochs  is  accumulated  more  than  a  certain  threshold 
N  in  order  to  give  a  high  enough  accuracy  (e.g.,  >99.9%)  on  the  detection  results.  Hence,  the 
number  of  packets  N  characterizes  the  detection  delay  of  the  fault  localization  protocol.  We  fully 
implement  the  classic  Sketch  due  to  Alon  et  al.  [11]  in  C++  with  a  four- wise  hash  function,  and 
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perform  simulations  to  study  N. 

Since  in  DynaFL,  neighborhoods  are  inspected  by  the  AC  independently,  we  also  perform  sim¬ 
ulations  for  independent  neighborhoods  with  different  sizes.  Since  we  showed  DynaFL’s  security 
against  colluding  attacks  in  Section  7.7,  we  emulate  a  single  malicious  node  in  our  simulations. 
Our  setting  is  as  follows.  The  natural  packet  loss  rate  in  a  neighborhood  is  0.001  and  the  detection 
thresholds  for  both  flow  conservation  and  content  conservation  are  (3  =  2a  =  0.004.  Figure  7.16 
depicts  the  false  positive  rates  in  benign  cases  where  no  malicious  routers  exist  in  the  neighborhood. 
We  can  see  that  with  N  >  5000  packets,  the  false  positive  rate  is  under  1%. 

Figure  7.17  shows  the  false  negative  rates  with  a  malicious  router  which  only  drops  packets  with 
a  probability  of  0.005.  Figure  7.18  plots  the  false  negative  rates  with  a  malicious  router  which  both 
drops  and  modifies  packets  with  a  probability  of  0.005,  respectively.  We  can  see  that  the  sketch- 
based  approach  is  effective  in  detecting  packet  modification  attacks,  since  by  modifying  packets  the 
malicious  router  is  detected  faster  in  Figure  7.18  than  in  Figure  7.17. 


7.9  Summary 

After  identifying  the  fundamental  limitations  of  previous  path-based  fault  localization  protocols,  we 
explore  a  neighborhood-based  FL  approach.  We  present  DynaFL,  which  utilizes  delayed  function 
disclosure,  a  novel  technique  that  enables  secure  yet  efficient  checking  of  packet  content  conserva¬ 
tion. 

While  existing  path-based  FL  protocols  aim  to  identify  a  specific  faulty  link  (if  any),  DynaFL 
localizes  fault  to  a  coarser-grained  1-hop  neighborhood,  to  achieve  four  distinct  advantages.  First, 
DynaFL  does  not  require  any  minimum  duration  time  of  paths  or  flows  in  order  to  detect  data- 
plane  faults  as  path-based  FL  protocols  do.  Thus,  DynaFL  can  fully  cope  with  short-lived  flows 
which  are  popularly  seen  in  modern  networks.  Second,  in  DynaFL,  a  source  node  does  not  need 
to  know  the  exact  outgoing  path  while  path-based  FL  protocols  require  so.  Hence,  DynaFL  can 
support  agile  (e.g.,  packet-level)  load  balancing  such  as  VL2  routing  [38]  for  datacenter  networks. 
Third,  a  DynaFL  router  only  needs  around  4MB  per-neighbor  state  based  on  our  classic  Sketch 
implementation,  while  a  router  in  a  path-based  FL  protocol  requires  per-path  state.  Finally,  a 
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DynaFL  router  only  maintains  a  single  secret  key  shared  with  the  AC,  while  a  router  in  a  path- 
based  FL  protocol  needs  to  manage  100  to  10000  secret  keys  in  measured  ISP  topologies. 

We  anticipate  that  our  work  can  spark  future  endeavors  in  designing  practical  network  FL 
protocols  for  modern  networks  such  as  ISP,  enterprise,  and  datacenter  networks. 


CDF  CDF  CDF 


142 


CHAPTER  7.  DYNAFL 


link  bandwidth  consumption(KB/s) 

(a)  ISP  1221 


link  bandwidth  consumption(KB/s) 

(c)  ISP  1755 


link  bandwidth  consumption(KB/s) 

(e)  ISP  3967 


link  bandwidth  consumption(KB/s) 

(b)  ISP  1239 


link  bandwidth  consumption(KB/s) 

(d)  ISP  3257 


link  bandwidth  consumption(KB/s) 

(f)  ISP  6461 


Figure  7.15:  CDF  of  per-link  bandwidth  consumption  for  the  reporting  messages  in  DynaFL. 


7.9.  SUMMARY 


143 


Figure  7.16:  False  positive  rates  with  no  malicious  activity  in  a  neighborhood  with  different  numbers 
of  nodes.  The  natural  packet  loss  rate  in  a  neighborhood  is  0.001  and  the  detection  thresholds  for 
both  flow  conservation  and  content  conservation  are  Td  =  /3  =  2a  =  0.004. 
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Figure  7.17:  False  negative  rates  in  a  malicious  neighborhood  with  five  nodes,  where  the  malicious 
node  only  drops  packets.  The  natural  packet  loss  rate  in  a  neighborhood  is  0.001,  the  detection 
thresholds  for  both  flow  conservation  and  content  conservation  are  Td  =  /3  =  2a  =  0.004,  and  the 
malicious  packet  dropping  rate  is  0.005. 
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Figure  7.18:  False  negative  rates  in  a  malicious  neighborhood  with  five  nodes,  where  the  malicious 
node  both  drops  packets  and  modifies  packets.  The  natural  packet  loss  rate  in  a  neighborhood  is 
0.001,  the  detection  thresholds  for  both  flow  conservation  and  content  conservation  are  Tj  =  /3  = 
2 a  =  0.004,  the  malicious  packet  dropping  rate  is  0.005,  and  the  malicious  packet  modification  rate 
is  0.005. 
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Related  Work 


In  this  chapter,  we  summarize  the  related  work  and  discuss  their  limitations.  We  first  discuss 
the  deficiencies  of  existing  fault  detection  and  fault  localization  protocols,  and  then  briefly  discuss 
existing  work  utlizing  trusted  computing  for  solving  other  network  security  problems. 

8.1  Detecting  the  Presence  of  Data-Plane  Attacks 

A  related  line  of  research  aims  to  only  detect  the  presence  of  data-plane  attacks,  without  providing 
the  ability  to  localize  the  attacks  [18,  36,  68]. 

For  example,  the  protocol  due  to  McCune  et  al.  [68]  aims  to  detect  the  presence  of  “Denial- 
of-Message”  attacks  in  sensor  networks,  where  a  malicious  sensor  node  drops  broadcast  messages 
sent  by  a  base  station.  The  base  station  solicits  authenticated  acknowledgments  from  a  randomly 
selected  subset  of  sensor  nodes  unpredictable  to  the  attacker.  Hence,  the  failure  to  receive  au¬ 
thenticated  acknowledgments  from  certain  selected  sensor  nodes  indicates  the  presence  of  packet 
dropping  attacks.  However,  the  protocol  does  not  provide  a  mechanism  to  localize  malicious  sensor 
nodes  that  drop  the  messages. 

Both  Stealth  probing  [18]  and  PQM  [36]  employ  secure  probing  techniques  to  detect  data- 
plane  attacks  and  monitor  the  forwarding  quality  for  an  entire  end-to-end  paths.  These  end-to-end 
path  monitoring  schemes  are  commonly  used  in  conjunction  with  secure  multi-path  routing  which 
aims  to  mitigate  data-plane  attacks.  For  example,  both  ACR  [90]  and  Sprout  [33]  rely  on  end-to- 
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end  monitoring  to  evaluate  path  quality  and  to  avoid  low-quality  paths  by  switching  to  fail-over 
paths.  However,  being  too  lenient  and  oblivious  to  the  faulty  elements,  such  monitor-and-switch 
methods  suffer  from  long  path  exploration  delays,  because  without  localizing  the  faulty  elements, 
further  explored  paths  may  still  contain  the  same  faulty  elements,  thus  resulting  in  exponential 
path  exploration  complexity  in  the  worst  case. 

Realizing  the  importance  of  localizing  data-plane  attacks,  researchers  have  recently  proposed 
several  approaches  for  network  fault  localization.  As  we  show  below,  the  known  secure  fault  local¬ 
ization  protocols  are  all  path-based  and  suffer  from  either  security  vulnerabilities  or  high  protocol 
overhead. 


8.2  Vulnerabilities  of  Existing  Fault  Localization  Schemes 

Perlman  first  described  the  idea  of  acknowledgment-based  approaches  to  detect  data-plane  adver¬ 
saries  and  achieve  robust  routing  in  the  presence  of  Byzantine  failures  [76].  However,  details  of  how 
to  achieve  secure  fault  localization  are  not  presented.  We  summarize  the  major  security  pitfalls  of 
recent  fault  localization  protocols  as  follows. 


Evading  and  framing  attacks.  In  ODSBR  [19]  and  Secure  Traceroute  [74],  the  source  node 
monitors  the  end-to-end  loss  rate  of  the  path;  and  only  when  the  observed  loss  rate  exceeds  a  certain 
threshold,  the  source  starts  probing  specific  nodes  in  the  path  soliciting  acknowledgments  for  the 
subsequent  packets  the  source  sends.  However,  a  malicious  node  can  safely  drop  packets  when  the 
probing  is  not  activated,  while  behaving  “normally”  when  probing  is  invoked.  Hence,  the  source 
can  never  catch  the  malicious  nodes  nor  bound  the  malicious  dropping  rate,  unless  the  probing 
is  always  activated  which  incurs  high  overhead.  In  addition,  ODSBR  employs  binary  search  in 
the  probing  phase  for  dropping  localization,  until  the  algorithm  converges  to  a  specific  link.  Since 
the  binary  search  algorithm  proceeds  on  each  packet  lost  (possibly  due  to  natural  loss),  in  the 
presence  of  natural  packet  loss  the  algorithm  either  does  not  converge  or  incurs  high  false  positives 
by  framing  benign  links. 
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Packet  modification  attacks.  A  considerable  number  of  fault  localization  protocols  require 
each  router  to  maintain  certain  traffic  summaries  for  the  received  and  sent  packets.  By  periodically 
comparing  the  local  traffic  summaries  with  other  routers’  and  checking  flow  conservation,  such 
fault  localization  protocols  can  identify  faulty  links  based  on  flow  conservation.  However,  due 
to  the  challenges  of  efficiently  authenticating  packets  and  the  traffic  summaries,  many  of  such 
fault  localization  protocols  fail  to  authenticate  packets  using  the  traffic  summaries,  thus  vulnerable 
to  sophisticated  packet  modification  attacks.  For  example,  WATCHERS  [43],  Audit  [14]  and 
Fatih  [71]  implement  the  traffic  summaries  using  either  counters  or  Bloom  Filters  [24]  with  no  secret 
keys,  thus  remaining  vulnerable  to  packet  modification  attacks.  The  recently  proposed  Network 
Confessional  [15]  also  fails  to  prevent  packet  modification  attacks  due  to  the  lack  of  efficient  packet 
authentication. 

Collusion  attacks.  Liu  et  al.  propose  enabling  two-hop-away  routers  in  the  path  to  monitor  each 
other  [58]  by  using  2-hop  acknowledgment  packets.  However,  such  a  2-hop-based  detection  scheme 
is  vulnerable  to  colluding  neighboring  routers.  Similarly,  both  Watchdog  [66]  and  Catch  [65]  can 
identify  and  isolate  malicious  routers  for  wireless  ad  hoc  networks,  where  a  sender  S  verifies  if 
the  next-hop  node  ft  indeed  forwards  5’s  packets  by  promiscuously  listening  to  /,’ s  transmission. 
Both  Watchdog  and  Catch  are  vulnerable  to  collusion  attacks,  where  a  malicious  node  fm  drops 
the  packets  of  a  remote  sender  S  (which  is  out  of  the  promiscuous  listening  range  of  fm )  while  the 
colluding  neighbors  in  the  promiscuous  listening  range  of  fm  intentionally  do  not  report  the  packet 
dropping  behavior  of  fm. 


8.3  Applicability  and  Practicality 

Among  the  known  secure  proposals,  the  protocol  due  to  Avramopoulos  et  al.  [17]  incurs  high 
computational  and  communication  overhead,  because  it  requires  acknowledgments  from  all  routers 
in  the  path,  and  requires  multiple  digital  signature  generation  and  verification  operations  for  each 
data  packet. 

Recently,  Barak  et  al.  proposed  a  set  of  fault  localization  protocols  for  the  Internet  [21],  How- 
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ever,  their  statistical  FL  protocol  is  mainly  optimized  to  reduce  communication  overhead;  and 
consequently  achieves  a  rather  poor  best  case  detection  rate  on  the  order  of  106  packets. 

A  recent  proposal  due  to  Wang  et  al.  [89]  for  forwarding  fault  localization  in  sensor  networks 
requires  a  special  tree-like  routing  infrastructure  where  the  communications  take  place  only  between 
a  sensor  node  and  the  same  trusted  base  station. 

8.4  Trusted  Computing  for  Network  Security 

Many  efforts  in  trusted  computing  focus  on  efficient  implementation  of  remote  attestation,  sealed 
storage,  and  secure  boot  for  bootstrapping  trust  on  commodity  computers  [75,  69].  A  few  proposals 
also  consider  utilizing  trusted  computing  to  address  network  security  plagues  [82,  40,  80].  However, 
BIND  [82]  focuses  on  routing  security  and  cannot  secure  against  raw  user  input  and  configurations. 
Not-a-Bot  [40]  leverages  trusted  computing  and  TPM  to  mitigate  DDoS  attacks  but  not  to  secure 
the  network  layer.  Recently,  Saroiu  et  al.  [80]  propose  the  design  of  TPM-based  “trusted  sensors” 
via  remote  attestation  to  secure  a  broad  range  of  mobile  applications. 


Chapter  9 


Conclusion 


The  rising  demand  for  high-quality  online  services  requires  reliable  packet  delivery  at  the  network 
layer.  Data-plane  fault  localization  is  recognized  as  a  promising  means  to  this  end,  since  it  enables 
a  source  node  to  localize  faulty  links,  find  a  fault-free  path,  and  enforce  contractual  obligations 
among  network  providers.  This  dissertation  designs,  analyzes,  implements,  and  evaluates  secure 
and  practical  fault  localization  protocols.  Instead  of  aiming  to  detect  any  single  forwarding  failure, 
we  demonstrate  that  fault  localization  protocols  can  effectively  limit  the  negative  influence  an 
adversary  can  inflict  at  the  data  plane,  with  a  provable  lower  bound  on  the  forwarding  correctness. 
Based  on  the  philosophy  of  limiting  the  adversarial  activities,  we  develop  a  suite  of  probabilistic 
algorithms  and  leverage  emerging  hardware  virtualization  technologies,  by  which  we  dramatically 
reduce  the  protocol  overhead  without  sacrificing  security. 

While  we  compare  the  efficiency  of  the  proposed  protocols  early  in  Chapter  1  (Table  1.1), 
Table  9.1  further  compares  the  effectiveness  of  fault  localization  between  the  proposed  protocols. 
Clearly,  PAAI  incurs  longer  detection  delay  than  the  other  protocols  due  to  its  use  of  packet 
sampling,  where  the  fate  of  unsampled  packets  (e.g.,  >  90%  packets)  cannot  contribute  to  the 
monitoring  process.  Hence,  PAAI  requires  more  packet  transmissions  to  achieve  accurate  fault 
localization.  However,  PAAI  does  not  require  any  changes  to  the  existing  packet  headers;  thus  it 
is  most  applicable  to  networks  where  the  packet  header  cannot  be  changed. 

ShortMAC  represents  a  more  efficient  path-based  protocol  than  PAAI,  by  acknowledging  a  set 
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of  packets  using  counters  in  a  single  ACK.  In  addition,  ShortMAC  does  not  require  loose  time 
synchronization  as  PAAI  and  DynaFL  do.  By  utilizing  the  efficient  k- bit  MAC  authentication, 
ShortMAC  dramatically  reduces  the  communication  overhead,  and  enables  the  use  of  state-efficient 
counters.  As  a  path-based  protocol,  ShortMAC  localizes  data-plane  faults  to  a  specific  link  without 
the  use  of  special  hardware  support  (as  TrueNet  does),  which  cannot  be  achieved  by  DynaFL. 


Protocol 

Detection  Delay 

Forwarding 

Correctness 

Precision 

Global  Sharing? 

PAAI 

3.5  x  104  pkts 

95% 

link 

no 

ShortMAC 

2  x  103  pkts 

95% 

link 

no 

TrueNet 

2  x  103  pkts 

95% 

link 

(software  attack  only) 

yes 

DynaFL 

5  x  104  pkts 

95% 

1-hop  neighborhood 

yes 

Table  9.1:  Comparison  of  the  fault  localization  effectiveness  between  the  proposed  protocols.  The 
numeric  values  for  the  detection  delay  and  guaranteed  forwarding  correctness  are  derived  from 
simulations  with  the  path  length  d  =  5,  allowed  upper  bound  on  false  positive  and  negative  rates 
5  =  0.01,  natural  loss  rate  p  =  0.005,  and  per-link  detection  threshold  X*.  =  0.01. 


However,  1-hop-based  fault  localization  protocols  have  several  fundamental  advantages  over 
path-based  protocols.  First,  in  both  TrueNet  and  DynaFL,  routers  only  maintain  per-neighbor 
state,  while  as  path-based  protocols,  both  PAAI  and  ShortMAC  require  storing  per-path  state  at 
routers.  In  addition,  1-hop-based  fault  localization  protocols  can  support  dynamic  routing  paths 
and  traffic  patterns.  Finally,  TrueNet  enables  secure  global  sharing  of  detection  results,  due  to  the 
use  of  trusted  computing,  and  DynaFL  achieves  this  property  due  to  the  involvement  of  a  trusted 
centralized  controller.  However,  these  benefits  of  1-hop-based  protocols  come  at  a  cost.  Specifically, 
TrueNet  requires  the  use  of  trusted  computing,  thus  potentially  relying  on  special  hardware  support 
(such  as  TPM  chips)  and  being  vulnerable  to  hardware-based  data-plane  attacks.  DynaFL  does 
not  rely  on  trusted  computing,  but  localizes  fault  to  a  specific  1-hop  neighborhood  instead  of  a 
specific  fink. 

Finally,  we  discuss  the  applicability  of  these  protocols  to  different  types  of  real  networks,  namely, 
1)  wireless  sensor  networks  or  mesh  networks  (or  wireless  multi- hop  networks  in  general),  2)  ISP 
networks,  3)  enterprise  and  datacenter  networks,  and  4)  the  Internet. 
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Wireless  multi-hop  networks.  First,  since  wireless  multi-hop  networks  tend  to  have  lower 
bandwidth  resources  than  wired  networks  (and  particularly,  packet  transmission  is  costly  in  sensor 
networks),  DynaFL  may  be  inapplicable  due  to  its  relatively  high  communication  overhead  for 
reporting  the  traffic  summaries  to  the  AC.  In  addition,  since  nodes  in  a  wireless  multi- hop  network 
may  be  deployed  at  publically  accessible  locations  (such  as  sensor  networks),  these  nodes  may  be 
subject  to  physical  compromise  in  which  case  the  (physical)  security  of  the  trusted  computing 
primitives  required  by  TrueNet  may  no  longer  hold.  Both  ShortMAC  and  PAAI-1  can  deal  with 
physical  compromise  of  nodes  and  are  bandwidth-efficient.  ShortMAC  requires  forwarding  nodes 
to  maintain  per-path  monitoring  state,  while  the  router  state  of  PAAI-1  is  bounded  by  the  link 
bandwidth.  Hence,  to  further  decide  whether  ShortMAC  or  PAAI-1  is  more  state-efficient,  a 
network  administrator  needs  to  take  as  inputs  the  network  size,  path  distribution,  and  link  capacity 
to  calculate  and  compare  the  router  storage  cost  for  each  protocol. 

ISP  networks.  ISP  networks  are  generally  well  managed  and  routers  are  physically  protected. 
Hence,  such  a  network  can  satisfy  the  security  requirements  of  trusted  computing  (in  other  words, 
most  attacks  are  through  remote  software  exploits)  and  enjoy  the  high  efficiency  and  small  router 
state  of  TrueNet.  In  contrast,  both  PAAI  and  ShortMAC  require  per-sender  key  storage  and  path 
monitoring  state,  which  may  not  scale  in  a  large  ISP  network.  Though  DynaFL  provides  small 
router  state  and  constant  key  storage,  it  only  localizes  data-plane  faults  to  a  1-hop  neighborhood, 
which  requires  further  investigation  within  the  suspicious  neighborhood  (thus  incurring  additional 
overhead) . 

Enterprise  and  datacenter  networks.  As  already  mentioned  in  DynaFL,  modern  enterprise 
and  datacenter  networks  may  employ  fine-grained  load  balancing  which  results  in  both  dynamic 
flows  and  paths.  DynaFL  is  the  only  protocol  among  the  proposed  ones  that  does  not  require  path- 
knowledge  or  path  stability,  and  hence  is  most  applicable  to  enterprise  and  datacenter  networks. 

The  Internet.  When  performing  fault  localization  at  the  scale  of  the  Internet,  we  treat  each 
Autonomous  System  (AS)  as  a  single  “node”.  Establishing  “trust  of  code”  via  trusted  computing 
across  administrative  domains  would  be  troublesome;  in  addition,  each  AS  physically  controls  its 
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own  routers  and  thus  can  compromise  the  “trust”  hardware  to  subvert  the  trusted  computing 
primitives.  Hence,  TrueNet  would  be  inapplicable  to  the  Internet  setting.  In  addition,  DynaFL 
requires  a  centralized  controller  to  manage  all  the  nodes,  which  may  be  impractical  either.  PAAI 
requires  time  synchronization  among  different  ASes,  which  may  incur  additional  complexity  and 
overhead  at  the  scale  of  the  Internet.  Finally,  ShortMAC  can  be  deployed  in  the  Internet,  while  the 
shared  secret  keys  between  ASes  can  be  established  via  existing  protocols  such  as  Passport  [60] . 

We  anticipate  that  this  dissertation  demonstrates  the  possibility  of  achieving  guaranteed  for¬ 
warding  correctness  via  secure  and  efficient  fault  localization.  We  hope  the  proposed  fault  local¬ 
ization  protocols  and  probabilistic  algorithms  can  serve  as  building  blocks  for  constructing  other 
secure  network  protocols  such  as  secure  routing  and  Denial-of-Service  (DoS)  defenses.  In  addition, 
we  intend  for  TrueNet  to  be  used  as  a  case  study  to  spark  future  research  on  leveraging  trusted 
computing  to  solve  other  network  security  problems.  As  future  work,  we  plan  to  derive  theoretical 
performance  bounds  for  fault  localization  protocols  with  a  distribution  of  natural  packet  loss  rate 
other  than  the  uniform  distribution.  We  plan  to  further  investigate  the  incremental  deployment 
and  real  deployment  issues  of  fault  localization  protocols. 
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Appendix  A 

Proofs  for  PAAI 


A.l  Proof  of  Theorem  7 

In  the  following,  we  analyze  the  full-ack  and  PAAI-1  protocols  together,  and  PAAI-2  separately. 
We  prove  for  the  general  case  where  the  adversary  controls  z  malicious  links  in  a  forwarding  path, 
and  an  upper  bound  of  a  in  Definition  5  can  be  derived  by  setting  z  =  1. 

Full-ack  and  PAAI-1.  Since  the  onion  report  used  in  the  full-ack  and  PAAI-1  schemes  can  be 
used  to  locate  a  specific  link  for  each  lost  packet,  under  converged  condition  each  malicious  link 
can  at  most  drop  T^r  fraction  of  packets  without  being  detected.  This  in  turn  implies  our  results 
in  Theorem  7  for  the  full-ack  and  PAAI-1  schemes. 

PAAI-2.  First  note  that  the  score  difference  A,;  =  |sj+i  —  Sj|  is  given  by 

(A.i)  a,  =  ,j+1.{i-[nu -„;)]»), 

y= o 

where  77, +1  is  the  number  of  times  that  /j+i  is  selected,  and  p*  is  the  average  drop  rate  of  link 
li .  Based  on  the  values  of  A *  and  rp  known  to  S,  §  can  compute  the  average  drop  rate  p*  by  (let 

c  =  n;;=oU-^)3): 


(A.2) 


i-a-f?)’ 

.rt  =  i-(ia 


,  k  =  0 


Vk+l 


)]* 


k  >  1. 
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Now  we  establish  an  end-to-end  drop  rate  threshold  ipth-  We  must  ensure  that  the  actual  end- 
to-end  drop  rate  exceeds  ? pth  only  when  at  least  one  malicious  link  drops  more  than  Tdr  percentage 
of  packets  (where  Tdr  is  the  per- link  drop  rate  threshold).  Thus  in  the  worst  case  (with  the  most 
natural  packet  drop  loss,  i.e.,  using  Tdr  as  the  per-link  drop  rate),  we  can  compute 

(A. 3)  i/Hh  =  l-(l-Tdr)2d. 

If  each  link  li  has  a  drop  rate  p*  <  Tdr,  the  end-to-end  drop  rate  ipd  is  given  by: 

d—1 

(A.4)  =  1  -  [J](  1  -  p *)]2  <  1  -  (1  -  Tdr)2d  ihh- 

i= o 

Thus  when  ipd  >  r,Pth-  there  must  be  at  least  one  malicious  link.  Then  §  derives  the  individual  drop 
rate  p*  of  each  link  p  by  using  Equation  A. 2.  By  comparing  each  p*  with  Tdr,  §  can  identify  the 
malicious  links.  Note  that  in  the  converged  condition,  there  is  no  false  negative,  while  the  false 
positive  is  given  by  Theorem  10. 

Now  we  compute  the  maximum  end-to-end  drop  rate  that  an  adversary  can  cause  without  being 
detected,  i.e.,  without  causing  ipd  >  Suppose  there  are  2:  malicious  links  with  the  drop  rate 
p%i1,  ■  ■  ■  ,P%\Z  >  Tdr ■  Given  the  fixed  threshold  ipth,  when  the  malicious  links  can  cause  maximum 
drop  rate  with  ipd  <  ^ th >  we  have: 

(A.5)  1  -  (1  -  p)2(d~z)  •  [II(1  -  PMk )]2  =  1  -  (1  -  Tdr)2d. 

k=  1 

Therefore,  z  malicious  links  can  drop  at  most 

Z 

(a.6)  i-ina-^>]2  = 

k=  1 

percentage  of  traffic  without  being  detected. 

A. 2  Proof  of  Corollary  8 

Suppose  that  a  malicious  link  l_ Mk  (k  =  1,2, ...  ,z)  drops  data,  probe  and  ack  packets  at  different 
rates,  denoted  by  PMki  an<^  00 Mk  respectively.  Given  the  fixed  threshold  'tpth-  the  malicious 


(1  ~  Tdr)2d 

(1  -  p)2(d~z) ' 
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links  can  cause  the  maximum  drop  rate  when  ipd  <  ip th ,  yielding: 

Z 

(A.7)  1  -  (1  -  p)2^  f](l  -  TkMiX1  -  vMi)(  1  -  uMi)  <  th ■ 

i=  1 

Therefore,  the  adversary  can  drop  at  most 

2  f  1  _  rp  \2 d 

(A.8)  1  -  11(1  -  ^J(l  -  ^,)(1  -  u,Ml)  =  1  - 

fraction  of  packets  without  being  detected.  This  yields  the  same  result  as  Theorem  7. 


A. 3  Proof  of  Corollary  9 

The  results  are  straightforward  for  full-ack  and  PAAI-1.  For  PAAI-2,  leveraging  (1  —  x)n  =  1  —  nx 
when  x  — >  0  and  neglecting  the  second  order  p 2  term,  we  can  transform  the  formula  of  £  in 
Theorem  7  as: 

(A. 9)  C  =  2de  +  p  ■  (4d2e  +  z( 2  —  4de)) . 

This  proves  that  (  increases  proportionally  to  p.  Next,  we  show  that  the  optimal  strategy  of  the 
adversary  (to  drop  the  maximum  traffic  with  2:  compromised  links)  is  to  deploy  only  one  malicious 
link  for  one  path.  We  give  a  sketch  of  the  proof  by  showing  two  extreme  cases  to  illustrate  the 
intuition: 

In  one  extreme  case  where  the  adversary  deploys  all  2:  compromised  links  on  one  path,  the 
adversary  can  drop  at  most: 

(A. 10)  Ci  =  2de  +  p  ■  (4 d2e  +  z( 2  —  4 de)) . 

fraction  of  packets  as  calculated  above.  In  the  other  extreme  case,  where  the  adversary  deploys 
one  compromised  link  for  one  path  (thus  2:  paths  contain  a  comprised  link) ,  the  adversary  can  drop 
at  most: 


(ATI) 


C2  =  z(  2 de  +  p  ■  (4 d2e  +  1  •  (2  —  4de)) 
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fraction  packets.  Obviously  we  have 

(A. 12)  Ci  <  C2  <  ^  •  Ci- 

A. 4  Proof  of  Theorem  10 

In  the  following  proof,  we  first  study  how  many  packet  transmissions  are  required  to  estimate  the 
drop  rate  of  a  single  link  I4  within  a  certain  accuracy  interval.  Suppose  that  the  true  value  of  drop 
rate  of  lt  is  pt,  and  the  estimated  drop  rate  of  U  is  p*.  We  compute  the  number  of  packets  needed 
to  achieve  a  (ep* ,  <5)-accuracy  for  p* : 

(A. 13)  Pr{\p*  —  p[\  >  ep*)  <  5, 

i.e. ,  with  probability  1  —  5  the  estimated  p*  is  within 

(A.  14)  (pt  —  ep* ,  Pi  +  ep*). 

Then  we  compute  the  total  number  of  packets  needed  to  achieve  a  (ep* ,  5)-accuracy  for  every  link’s 
p*.  In  the  following  we  analyze  the  full-ack  scheme  and  the  PAAI  protocols  in  turn. 

Full-ack.  We  first  study  a  given  link  Zj.  Let  p*  be  the  estimated  drop  rate  of  link  Z®,  and  pt 
be  the  observed  conditional  probability  that  link  U  correctly  forwards  both  a  data  packet  and  the 
returning  ack  given  that  a  data  packet  reaches  the  upstream  end  of  Zj,  i.e.,  node  /)  in  Figure  2.1. 
Thus  we  have: 

(A.15)  pi  =  (1  -p-)2- 

We  define  each  time  a  data  packet  reaches  node  /,;  as  a  random  trial  of  li  (or  trial  for  in 
short).  Then  using  Maximum  Likelihood  Estimation  of  p*  and  Hoeff ding’s  inequality ,  we  have: 

(A. 16)  Pr(\pi-p*\>ePi)<2e-2Ni£$i  =>  Nt  = 

Pi 

Note  that  we  are  not  interested  in  ePi,  but  ep*  instead.  However,  p*  cannot  be  directly  estimated, 
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but  must  be  derived  from  Equation  A.  15,  i.e.: 

(A.17)  P*i=l-P?. 


Given  the  error  ePi  of  pi,  we  can  further  derive  the  error  ep*  of  p*  using  the  Uncertainty  Propa¬ 
gation  Rule : 


(A.18) 


i  dp*  1  -i 

*  =  =  2^ 


1 


Combining  Equations  A.  15,  B.13  and  A.18,  and  given  ep*  <  e  we  have: 

ln(§)  ln(§)  ln(?) 

(A. 19)  Ni=  KsJ  Ks>  "  K5> 


> 


2(2eft-  -P?) 


8e?.  •  (1  -p)2  8e2-(l  -  p)2' 


Now  we  compute  the  number  of  packets  needed  to  give  an  estimate  with  (e,  <5)-accuracy  for 
every  link  in  a  given  path.  When  each  packet  transmitted  by  the  source  can  reach  node  fd-i,  it 
provides  a  trial  for  every  link  Zj.  Therefore,  transmitting  Ar?;  packets  to  fd-i  also  suffices  to  give 
other  links  enough  trials,  which  requires 


(A. 20) 


Nd _! 


1 

(1  -  Tdr)d 


Mg) 

8e2  •  (1  —  p)2+d 


total  packets  transmitted  from  the  source. 


PAAI  protocols  The  result  for  PAAI-1  is  straightforward,  therefore  we  now  focus  on  PAAI-2. 
Let  p^  be  the  estimated  drop  rate  of  link  1^.  Let  the  event  that  node  fk+\  is  selected  be  a  random 
trial  for  Ik  (or  trial  for  Ik  for  short).  Let  pk  be  the  observed  conditional  probability  that  node  fk+ 1 
fails  to  ack  when  fk+i  is  selected ,  and  p*k  be  the  true  value  of  pk-  Similar  to  the  proof  of  full-ack 
scheme  above,  using  Hoeffding’s  inequality  we  can  have: 

(A.21)  Pr(\pk-pl\>ePk)<2e-2Nkek  =*  Nk  = 

Pk 

Again,  since  we  are  not  interested  in  ePk ,  but  instead  ep* ,  we  derive  ep*  from  ePk  in  the  following. 
Sine e  pk  =  leveraging  Equation  A. 2  and  simplifying  it  using  (1  —  x)n  =  1  —  nx  when  x  — >  0 
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and  neglecting  high  order  (>  2)  terms  of  x,  we  can  have: 
(A. 22) 


Pk 


k— 1 

o  Pk-^Pv 


3—  ^ry 

y= o 


Using  the  Uncertainty  Propagation  Rule  further  yields: 

k— 1  o  *  o  *  k—1 


(A. 23) 


£  = 

Pk 


ST  ,  \2,  A  .  \2  _  V^2  ,  1  2 

•  eP*y)  ‘  ~  l^ep*y^~  Q6Pk * 


y=o  r3^ 


J/=0 


’»  9 


(A. 24) 


Solving  the  above  recursion  and  leveraging  Equation  A. 21,  we  further  have: 

Ml) 


N  =  2 


d  y  <5 ) 
18e2 


Now  we  compute  the  total  number  of  packets  needed  to  give  a  (e,  5)-accuracy  estimate  for  every 
link’s  drop  rate  in  a  given  path.  If  we  abstract  a  random  trial  for  link  l &  as  coupon  k,  then  a  path 
with  length  B  has  B  different  coupons.  The  problem  is  to  compute  the  expected  wait  time  (number 
of  trials)  to  gather  N  copies  for  each  coupon  k.  When  N  =  1  the  problem  reduces  to  the  classic 
Coupon  Collector  problem,  which  has  an  expected  wait  time  0(d  ■  log(d)).  With  IV  ^  1,  the  wait 
time  has  a  simple  upper  bound: 

(A. 25)  0(N-d  •  log(d))  =  0( 2d^  •  d  ■  log (d)). 


This  proves  the  theorem. 


A. 5  Proof  of  Corollary  11 

Since  the  proof  of  this  corollary  is  straightforward,  we  only  give  a  proof  sketch  here.  Given  that  the 
natural  loss  rate  p  «  1  in  practice,  we  can  approximate  (1  —  p)2+d  ~  1  for  calculating  N\  and  N2 
in  Theorem  10.  We  can  then  study  the  influence  of  each  parameter  by  taking  a  partial  derivative 
on  each  parameter  from  N±  and  IV2. 


Appendix  B 


Proofs  for  ShortMAC 

B.l  Proof  of  Lemma  13 

Recall  from  Section  5.3  that  in  ShortMAC,  the  source  finds  the  first  C^ad  such  that  C^ad  >  Tin,  and 
identifies  link  lt  as  malicious.  In  this  proof,  we  first  derive  the  upper  bound  /3  of  malicious  packet 
injection  (which  is  based  on  Tn)  according  to  the  upper  bound  <5  of  false  negative  rate.  Then  we 
calculate  the  injection  threshold  T)n  given  the  false  positive  upper  bound  5. 

With  £>bit  MACs,  when  /)_]  receives  a  fake  packet,  the  probability  that  will  be  increased 
is  q  =  2^1 ,  since  the  adversary  can  only  randomly  generate  a  fc-bit  string  for  the  fake  packet 
without  knowledge  of  the  secret  keys  of  other  (benign)  routers.  The  probability  that  Cfld  will  be 
increased  is  q(l  —  q). 


Malicious  Injection  Bound.  WLOG,  suppose  fm  is  a  malicious  router  and  fm+i  is  benign 
(there  can  be  other  malicious  routers  between  the  source  and  fm).  Suppose  the  malicious  routers 
between  the  source  and  fm  (including  fm )  inject  y  packets  on  link  lm+ Then  whether  lm+ i  will 
be  detected  depends  on  the  value  of  and  the  false  negative  rate  P/n  is  given  by: 

Ffn  =  P (C*#!  <  Tin) 

(B.l)  =  P((<?  -  e)y  <  Tin) 

<  2e~2y^q~^r)  (Hoeffding’s  inequality), 
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where  e  is  the  deviation  and  0  <  e  <  q.  To  achieve  the  desired  upper  bound  Pfn  <  5,  we  set  the 
threshold  /3  such  that 

(B.2)  2e~mq~I^)2  =  5. 

Solving  for  (3  gives: 

T-  \/(  In  ?)2  +  8 qTin  In  §  +  In  ? 

(b.3)  +  5 - a. 

q  4g2 

(B.3)  implies  that  if  the  adversary  injects  more  than  /3  packets  on  a  single  link  lm+ 1,  will 

exceed  X)n  and  Zm_|_i  will  be  detected  with  a  high  probability  >1  —  5  (or  a  false  negative  rate  lower 
than  5). 


Injection  Detection  Threshold.  WLOG,  suppose  fm  is  a  malicious  router  and  fm+ 1  is  benign 
(there  can  be  other  malicious  routers  between  the  source  and  fm).  Suppose  the  malicious  routers 
between  the  source  and  fm  (including  fm)  inject  y  packets  on  link  lm+ 1-  False  positives  occur  when 
Cm+i  <  Tin  but  Cfld  >  Tin  (where  i  >  m  +  2).  (WLOG,  suppose  /*_ i  and  ,/)  are  honest.)  Hence, 
a  benign  link  Zj  is  falsely  accused,  and  the  false  positive  rate  P fp  is: 

d 

p fp:=  nCbrZ$i<  Tin,  c*ad  >Tin\li  benign) 

(B.4)  i=m-\-  2 

<  (1  ■  P(C'^+l  <  Cm+2)- 

The  actual  C^f1  and  C^f2  values  can  be  represented  by: 


(B.5) 

If  we  can  bound 
(B.6) 


CSSi  =  («-£!)•» 

Cm+2  =  (?(!  ~  9)  +  ^2)  •  y- 


ei  =  e2 


e  < 


P 


2 


2  ’ 
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then  we  can  guarantee  that  C!^f1  >  2-  Therefore,  we  have: 


(B.7) 


P/P<l-P(e<y) 

Q2 

=  P(e  >  y ) 

<2e-2^)2. 


Note  that  in  (B.7),  we  leverage  Hoeffding’s  inequality  and  the  fact  y  >  Tin  in  the  false  positive 
cases. 


To  achieve  the  desired  upper  bound  P fp  <  5,  we  set  the  threshold  T*n  such  that 


(B.8) 


2e-2T»(V)2 


-5. 


Solving  for  T*n  gives 


(B.9) 


T  — 

±in  — 


21n¥ 


B.2  Proof  of  Lemma  14 


Drop  Detection  Threshold  and  Detection  Space.  False  positives  arise  when  the  observed 
drop  rate  of  a  benign  link  Z*,  denoted  by  p*,  exceeds  the  drop  detection  threshold  T*,.  To  bound 
the  total  false  positive  rate  below  5,  it  is  sufficient  to  ensure  that  each  p*  may  exceed  with 
a  probability  <5j  =  ^  (since  we  need  to  ensure  the  overall  false  positive  rate  La  Si  <  <5),  i-e, 
P(p*  >  Tdr)  <  which  is  equivalent  to: 

(B.10)  P(p*-p>Tdr-p)<^. 


By  using  Hoeffding’s  inequality,  we  have: 


_  p  >  Tdr  -  p  )  <  2e-2CH d(Tdr-r) 


(jgood  ^ 


ln(¥) 

2(7*.  —  p)2 ' 


(B.ll) 
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Recall  that  the  check-dropping  procedure  will  detect  the  malicious  link  with  excessive  drop  rate 
closest  to  the  source,  denoted  by  lm.  So  we  need  to  guarantee  Cfood  >  2(r^  -p)2  f°r  any  *  <  m- 


Since  we  also  have 

(B.12) 

Cfood  >N{  1  -  Tdry  for  i  <  m, 

we  get: 

(B.13) 

ln(¥) 

N  —  6 

2(Tdr-pf(l-Tdr)d' 

Analogously,  we  can  also  calculate  the  false  negative  rate,  which  yields  the  same  result. 

Malicious  Dropping  Bound.  Suppose  a  malicious  node  fm  closest  to  the  source  receives  C'^cv 
data  packets,  but  claims  that  it  receives  Cm°d  data  packets,  and  drops  x  fraction  of  the  received 
Cm°d  data  packets  on  lm+ We  first  have  the  following  facts: 


(B.14) 

s~irecv  ^  pygood 
um-l 

C+l  =  (l-i)C"”+/3. 

To  make  neither  of  its  incident  links  undetected,  fm  must  manage  to  satisfy: 


(B.15) 

pygood 

\  1  rp 

py good  —  dr 

°m-l 

pygood 

m+1  1  rp 

rgood  -  1  ±dr’> 

which  yields 

(B.16) 

cgoo_d  >  (1  _ 

>  (1  -  Tdr)dN. 

Solving  (B.14),  (B.15)  and  (B.16),  we  have 


(B.17) 

I-1  (1  Tdr)  +  n(\ -  Tiry 

=  a. 

B.3.  PROOF  OF  THEOREM  15 
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B.3  Proof  of  Theorem  15 


(a,  /3)s~ Statistical  Security  can  directly  follow  Lemma  13  and  Lemma  14.  In  the  following,  we  will 
prove  (fi,  6)— Guaranteed  Forwarding  Correctness. 

Given  N  and  5,  we  can  set  the  drop  detection  threshold  Tdr  from  Lemma  14  and  the  injection 
bound  (3  from  Lemma  13.  Let  r]fake  denote  the  fake  data  packets  the  destination  has  received  but 
not  detected  yet,  and  rjeg  denote  the  legitimate  data  packets  the  destination  has  received  out  of  N 
data  packets  from  the  source.  Then  we  have: 


(B.18) 


e 


yjlcg 

~w 

p igood  fake 

Jjj+l  'I 

N 


When  no  fault  is  detected  in  the  identify  stage,  it  satisfies: 


(B.19) 


C9d°+°f  >  (1  -  Tdr)dN 

Tjfake  <  ^ 


By  (B.18)  and  (B.19),  we  have 

(B.20)  Q  =  (\-Tdr)d 


Finally,  we  can  integrate  ShortMAC  with  routing  as  follows.  The  control  plane  first  provides 
a  routing  path  p  for  the  source  S ,  and  then  avoids  faulty  links  using  feedback  (fault  localization 
results)  from  the  data  plane.  In  this  way,  ShortMAC  enables  the  source  to  identify  the  malicious 
links  that  reside  in  previously  explored  paths.  In  a  network  with  O  malicious  links,  the  source  can 
bypass  at  least  one  of  the  malicious  links  after  each  epoch  until  a  working  path  is  found,  resulting 
in  an  exploration  of  at  most  epochs  to  find  a  working  path. 
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Appendix  C 


Proof  for  DynaFL 

C.l  Proof  of  Property  2  for  Sketch 

A  sketch  function  T  over  a  set  of  elements  §  =  {pi,p2,  ■  ■  ■  ,Pn}  can  be  implemented  in  a  “streaming” 
mode  using  a  hash  function  h  [36],  where: 

(C.l)  h{pi)  ->  Vi 

and  Vi  denotes  a  vector.  More  specifically: 

(C.2)  J^(§)  =  T({pi,P2,  ■  •  •  ,Pn})  =  h(pi)  +  h(p2)  +  •  •  •  +  h(pn ) 

Hence,  given  two  packet  streams  §  =  {pi,p2,  ■  ■  ■  ,Pn}  and  §*  =  {p\  .p2, . . .  ,p'n>},  we  have: 

^(S  U  §')  =  ^({pi, . . .  ,Pn,Pl,  ■  ■  -tPu'}) 

(C.3) 

=  h{pi)  +  . . .  +  h{pn)  +  h(px)  +  •  ■  •  +  h(pni) 

and: 

y(§)  +y(§')  =  ^({pi,  .  .  .  ,Pn})  +  F({p'u  ■  ■  ■  ,Pn>}) 

(CA) 

=  h{p!)  +  . . .  +  h(pn)  +  hip-i)  +  . . .  +  h{pn,) 

From  Equations  C.3  and  C.4  we  can  see  that:  when  ,F(§)  U  FiS')  is  defined  as  ,F(§)  +  ^(S^, 
we  have  U  §')  =  JF(§)  U  JF(§')5  thus  proving  Property  2  for  Sketch. 
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